This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Going Beyond Linear RL: Sample Efficient Neural Function Approximation

Baihe Huang [email protected]. Peking University.    Kaixuan Huang [email protected]. Princeton University.    Sham M. Kakade [email protected]. University of Washington    Jason D. Lee [email protected]. Princeton University.    Qi Lei [email protected]. Princeton University    Runzhe Wang [email protected]. Princeton University    Jiaqi Yang [email protected]. Tsinghua University
Abstract

Deep Reinforcement Learning (RL) powered by neural net approximation of the Q function has had enormous empirical success. While the theory of RL has traditionally focused on linear function approximation (or eluder dimension) approaches, little is known about nonlinear RL with neural net approximations of the Q functions. This is the focus of this work, where we study function approximation with two-layer neural networks (considering both ReLU and polynomial activation functions). Our first result is a computationally and statistically efficient algorithm in the generative model setting under completeness for two-layer neural networks. Our second result considers this setting but under only realizability of the neural net function class. Here, assuming deterministic dynamics, the sample complexity scales linearly in the algebraic dimension. In all cases, our results significantly improve upon what can be attained with linear (or eluder dimension) methods.

1 Introduction

In reinforcement learning (RL), an agent aims to learn the optimal decision-making rule by interacting with an unknown environment [SB18]. Deep Reinforcement Learning, empowered by deep neural networks [LBH15, GBC16], has achieved tremendous success in various real-world applications, such as Go [SHM+16], Atari [MKS+13a], Dota2 [BBC+19], Texas Holdém poker [MSB+17], and autonomous driving [SSSS16]. Those modern RL applications are characterized by large state-action spaces, and the empirical success of deep RL corroborates the observation that deep neural networks can extrapolate well across state-action spaces [HIB+18, MKS+13b, LHP+15].

Although in practice non-linear function approximation scheme is prevalent, theoretical understandings of the sample complexity of RL mainly focus on tabular or linear function approximation settings [SLW+06, JOA10, AOM17, JAZBJ18, Rus19, ZB19, AYLSW19, JYWJ20a, JYWJ20b, WWDK21]. These results rely on finite state space or exact linear approximations. Recently, sample efficient algorithms under non-linear function approximation settings are proposed [WVR17, DJK+18, DKJ+19, DPWZ20, LCYW19, WCYW20, DYM21, ZLG20, YJW+20]. Those algorithms are based on Bellman rank [JKA+17], eluder dimension [RVR13b], neural tangent kernel [JGH18, AZLS19, DLL+19, ZCZG18], or sequential Rademacher complexity [RST15a, RST15b]. However, the understanding of how deep RL learns and generalizes in large state spaces is far from complete. Whereas the aforementioned works study function approximation structures that possess the nice properties of linear models, such as low information gain and low eluder dimensions, the highly non-linear nature of neural networks renders challenges on the applicability of existent analysis to deep RL. For one thing, recent wisdoms in deep learning theory cast doubt on the ability of neural tangent kernel and random features to model the actual neural networks. Indeed, the neural tangent kernel approximately reduces neural networks to random feature models, but the RKHS norm of neural networks is exponential [YS19]. Moreover, it remains unclear what neural networks models have low eluder dimensions. For example, recent works [DYM21, LKFS21] show that two layer neural networks have exponential eluder dimension in the dimension of features. Furthermore, [MPSL21] demonstrates hard instances where learning RL is exponentially hard even in the case that the target value function can be approximated as a neural network and the optimal policy is softmax linear. Thus, the mismatch between the empirical success of deep RL and its theoretical understanding remains significant, which yields the following important question:

What are the structural properties that allow sample-efficient algorithms for RL with neural network function approximation?

In this paper, we advance the understanding of the above question by displaying several structural properties that allow efficient RL algorithms with neural function approximations. We consider several value function approximation models that possess high information gain and high eluder dimension. Specifically, we study two structures, namely two-layer neural networks and structured polynomials (i.e. two-layer neural networks with polynomial activation functions), under two RL settings, namely RL with simulator model and online RL. In the simulator (generative model) setting [Kak03, SWW+18], the agent can simulate the MDP at any state-action pair. In online RL, the agent can only start at an initial state and interact with the MDP step by step. The goal in both settings is to find a near-optimal policy while minimizing the number of samples used.

We obtain the following results. For the simulator setting, we propose sample-efficient algorithms for RL with two-layer neural network function approximation. Under either policy completeness, Bellman completeness, or gap conditions, our method provably learns near-optimal policy with polynomial sample complexities. For online RL, we provide sample-efficient algorithms for RL with structured polynomial function approximation. When the transition is deterministic, we also present sample-efficient algorithms under only the realizability assumption [DKWY20a, WWK21]. Our main techniques are based on neural network recovery [ZSJ+17, JSA15, GLM18a], and algebraic geometry [Mil17, Sha13, BCR13, WX19].

1.1 Summary of our results

Our main results in different settings are summarized in Table 1. We consider two-layer neural networks f(x)=v,σ(Wx)f(x)=\langle v,\sigma(Wx)\rangle (where σ\sigma is ReLU activation) and rank kk polynomials (see Example 4.3).

Table 1: Baselines and our main results for the sample complexity to find an ϵ\epsilon-optimal policy.
rank kk polynomial Neural Net of Width kk
Sim. + Det. (R) Onl. + Det. (R) Sim. + Det. (R) Sim. + Gap. (R) Sim. + Stoch. (C)
Baseline O(dp)O(d^{p}) O(dp)O(d^{p}) O(dpoly(1/ϵ))O(d^{\mathrm{poly}(1/\epsilon)}) (*) O(dpoly(1/ϵ))O(d^{\mathrm{poly}(1/\epsilon)}) O(dpoly(1/ϵ))O(d^{\mathrm{poly}(1/\epsilon)})
Our results O(dk){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}O(dk)} O(dk){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}O(dk)} O~(poly(d)exp(k)){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\tilde{O}(\mathrm{poly}(d)\cdot\exp(k))} O~(poly(d,k)){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\tilde{O}(\mathrm{poly}(d,k))} O~(poly(d,k)/ϵ2){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\tilde{O}(\mathrm{poly}(d,k)/\epsilon^{2})}

We make the following elaborations on Table 1.

  • For simplicity, we display only the dependence on the feature dimension dd, network width or polynomial rank kk, precision ϵ\epsilon, and degree pp (of polynomials).

  • In the table Sim. denotes simulator model, Onl. denotes online RL, Det. denotes deterministic transitions, Stoch. denotes stochastic transitions, Gap. denotes gap condition, (R) denotes realizability assumption only, and (C) denotes completeness assumption (either policy complete or Bellman complete) together with realizability assumption.

  • We apply [DLMW20] for the deterministic transition baseline, and apply [DKWY20b] for the stochastic transition baseline. We are unaware of any methods that can directly learn MDP with neural network value function approximation111Prior work on neural function approximation has focused on neural tangent kernels, which would require dpoly(1/ϵ)d^{\mathrm{poly}(1/\epsilon)} to approximate a two-layer network [GMMM21]. .

  • In polynomial case, the baseline first vectorizes the tensor (1x)p\begin{pmatrix}1\\ x\end{pmatrix}^{\otimes p} into a vector in (d+1)p\mathbb{R}^{(d+1)^{p}} and then performs on this vector. In the neural network case, the baseline uses a polynomial of degree 1/ϵ1/\epsilon to approximate the neural network with precision ϵ\epsilon and then vectorizes the polynomial into a vector in dpoly(1/ϵ)\mathbb{R}^{d^{\mathrm{poly}(1/\epsilon)}}. The baseline method for realizable model (denoted by (*)) needs a further gap assumption of gapdpoly(1/ϵ)ϵ\text{gap}\geq d^{\mathrm{poly}(1/\epsilon)}\epsilon to avoid the approximation error from escalating [DLMW20]; note for small ϵ\epsilon this condition never holds but we include it in the table for the sake of comparison.

  • In rank kk polynomial case, our result O(dk)O(dk) in simulator model can be found in Theorem 4.7 and our result O(dk)O(dk) in online RL model can be found in Theorem 4.8. These results only require a realizability assumption. Efficient explorations are guaranteed by algebraic-geometric arguments. In neural network model, our result O~(poly(d)exp(k))\tilde{O}(\mathrm{poly}(d)\cdot\exp(k)) in simulator model can be found in Theorem 3.4. This result also only relies on the realizability assumption. For stochastic transitions, our result O~(poly(d,k)/ϵ2)\tilde{O}(\mathrm{poly}(d,k)/\epsilon^{2}) works for either policy complete or Bellman complete settings, as in Theorem 3.5 and Theorem 3.6 respectively. The O~(poly(d,k))\tilde{O}(\mathrm{poly}(d,k)) result for gap condition can be found in Theorem 3.8.

1.2 Related Work

Linear Function Approximation.

RL with linear function approximation has been widely studied under various settings, including linear MDP and linear mixture MDP [JYWJ20b, ZLKB20, YW20]. While these papers have proved efficient regret and sample complexity bounds, their analyses relied heavily on two techniques: they used the confidence ellipsoid to quantify the uncertainty, and they used the elliptical potential lemma to bound the total uncertainty [AYPS11]. These two techniques were integral to their analyses but are so restrictive that they generally do not extend to nonlinear cases.

Eluder Dimension.

[RVR13a, ORVR13] proposed eluder dimension, a complexity measure of the function space, and proved regret and sample complexity bounds that scaled with the eluder dimension, for bandits and reinforcement learning [WSY20a, JLM21]. They also showed that the eluder dimension is small in several settings, including generalized linear models and LQR. However, as shown in [DYM21], the eluder dimension could be exponentially large even with a single ReLU neuron, which suggested the eluder dimension would face difficulty in dealing with neural network cases. The eluder dimension is only known to give non-trivial bounds for linear function classes and monotone functions of linear function classes. For structured polynomial classes, the eluder dimension simply embeds into an ambient linear space of dimension dpd^{p}, where dd is the dimension, and pp is the degree. This parallels the lower bounds in linearization / neural tangent kernel (NTK) works [WLLM19, GMMM19, AZL19], which show that linearization also incurs a similarly large penalty of dpd^{p} sample complexity, and more advanced algorithm design is need to circumvent linearization[BL20, CBL+20, FLYZ20, WGL+19, GCL+19, NGL+19, GLM18b, MGW+20, HWLM20, WWL+20, DML21].

Bellman Rank and Completeness.

[JKA+17, SJK+19] studied RL with general function approximation. They used Bellman rank to measure the error of the function class under the Bellman operator and gave proved bounds in the term of it. Recently, [DKL+21a] propose bilinear rank and encompass more function approximation models. However, it is hard to bound either the Bellman rank or the bilinear rank for neural nets. Therefore, their results are not known to include the neural network approximation setting. Another line of work shows that exponential sample complexity is unavoidable even with good representations [DKWY20b, WAS20, WWK21], which implies the realizability assumption alone might be insufficient for function approximations.

Deterministic RL

Deterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [WVR13] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [DLMW20] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [DLMW20] learns the optimal policy with the number of trajectories linear with the eluder dimension.

2 Preliminaries

An episodic Markov Decision Process (MDP) is defined by the tuple =(𝒮,𝒜,H,,r)\mathcal{M}=(\mathcal{S},\mathcal{A},H,\mathbb{P},r) where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action set, HH is the number of time steps in each episode, \mathbb{P} is the transition kernel and rr is the reward function. In each episode the agent starts at a fixed initial state s1s_{1} and at each time step h[H]h\in[H] it takes action aha_{h}, receives reward rh(sh,ah)r_{h}(s_{h},a_{h}) and transits to sh+1(|sh,ah)s_{h+1}\sim\mathbb{P}(\cdot|s_{h},a_{h}).

A deterministic policy π\pi is a length-HH sequence of functions π={πh:𝒮𝒜}h=1H\pi=\{\pi_{h}:\mathcal{S}\mapsto\mathcal{A}\}_{h=1}^{H}. Given a policy π\pi, we define the value function Vhπ(s)V_{h}^{\pi}(s) as the expected sum of reward under policy π\pi starting from sh=ss_{h}=s:

Vhπ(s):=𝔼[t=hHrt(st,at)|sh=s]\displaystyle V_{h}^{\pi}(s):=\mathbb{E}\left[\sum_{t=h}^{H}r_{t}(s_{t},a_{t})|s_{h}=s\right]

and we define the Q function Qhπ(s,a)Q_{h}^{\pi}(s,a) as the the expected sum of reward taking action aa in state sh=ss_{h}=s and then following π\pi:

Qhπ(s,a):=𝔼[t=hHrt(st,at)|sh=s,ah=a].\displaystyle Q_{h}^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=h}^{H}r_{t}(s_{t},a_{t})|s_{h}=s,a_{h}=a\right].

The Bellman operator 𝒯h{\cal T}_{h} applied to Q-function Qh+1Q_{h+1} is defined as follow

𝒯h(Qh+1)(s,a):=rh(s,a)+𝔼s(|s,a)[maxaQh+1(s,a)].\displaystyle{\cal T}_{h}(Q_{h+1})(s,a):=r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathbb{P}(\cdot|s,a)}[\max_{a^{\prime}}Q_{h+1}(s^{\prime},a^{\prime})].

There exists an optimal policy π\pi^{\ast} that gives the optimal value function for all states, i.e. Vhπ(s)=supπVhπ(s)V^{\pi^{\ast}}_{h}(s)=\sup_{\pi}V_{h}^{\pi}(s) for all h[H]h\in[H] and s𝒮s\in\mathcal{S}. For notational simplicity we abbreviate VπV^{\pi^{\ast}} as VV^{\ast} and correspondingly QπQ^{\pi^{\ast}} as QQ^{\ast}. Therefore QQ^{\ast} satisfies the following Bellman optimality equations for all s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A} and h[H]h\in[H]:

Qh(s,a)=𝒯h(Qh+1)(s,a).\displaystyle Q^{\ast}_{h}(s,a)={\cal T}_{h}(Q^{\ast}_{h+1})(s,a).

The goal is to find a policy π\pi that is ϵ\epsilon-optimal in the sense that V1(s1)V1π(s1)ϵV_{1}^{\ast}(s_{1})-V_{1}^{\pi}(s_{1})\leq\epsilon, within a small number of samples. We consider two query models of interacting with MDP:

  • In the simulator model ([Kak03], [SWW+18]), the agent interacts with a black-box that simulates the MDP. At each time step h[H]h\in[H], the agent can start at a state-action pair (s,a)(s,a) and interact with the black box by executing some policy π\pi chosen by the agent.

  • In online RL, the agent can only start at the initial state and interact with the MDP by using a policy and observing the rewards and the next states. In each episode kk, the agent proposes a policy πk\pi^{k} based on all history up to episode k1k-1 and executes πk\pi^{k} to generate a single trajectory {shk,ahk}h=1H\{s_{h}^{k},a_{h}^{k}\}_{h=1}^{H} with ahk=πhk(shk)a_{h}^{k}=\pi^{k}_{h}(s_{h}^{k}) and sh+1k(|shk,ahk)s_{h+1}^{k}\sim\mathbb{P}(\cdot|s_{h}^{k},a_{h}^{k}).

2.1 Function approximation

In reinforcement learning with value function approximation, the learner is given a function class =1××H{\cal F}={\cal F}_{1}\times\cdots\times{\cal F}_{H}, where h{f:𝒮×𝒜[0,1]}{\cal F}_{h}\subset\{f:\mathcal{S}\times\mathcal{A}\mapsto[0,1]\} is a set of candidate functions to approximate QQ^{\ast}. The following assumption is commonly adopted in the literature [JYWJ20a, WSY20b, JLM21, DKL+21b].

Assumption 2.1 (Realizability).

QhhQ^{\ast}_{h}\in{\cal F}_{h} for all h[H]h\in[H].

The function approximation is equipped with feature mapping ϕ:𝒮×𝒜{ud:u2Bϕ}\phi:\mathcal{S}\times\mathcal{A}\mapsto\{u\in\mathbb{R}^{d}:\|u\|_{2}\leq B_{\phi}\} that is known to the agent. We focus the continuous action setting (e.g. in control and robotics problems) and make the following regularity assumption on the feature function ϕ\phi.

Assumption 2.2 (Bounded features).

Assume ϕ(s,a)Bϕ,(s,a)𝒮×𝒜\phi(s,a)\leq B_{\phi},\forall(s,a)\in{\cal S\times A}.

Notation

For any vector xdx\in\mathbb{R}^{d}, let xmax:=maxi[d]xix_{\max}:=\max_{i\in[d]}x_{i} and xmin:=mini[d]xix_{\min}:=\min_{i\in[d]}x_{i}. Let si()s_{i}(\cdot) denote the ii-th singular value, smin()s_{\min}(\cdot) denotes the minimum eigenvalue and smax()s_{\max}(\cdot) denotes the maximum eigenvalue. The conditional number is defined by κ():=smax()/smin()\kappa(\cdot):=s_{\max}(\cdot)/s_{\min}(\cdot). We use \otimes to denote Kronecker product and \circ to denote Hadamard product. For a given integer HH, we use [H][H] to denote the set {1,2,,H}\{1,2,\ldots,H\}. For a function f:𝔛𝔜f:\mathfrak{X}\mapsto\mathfrak{Y}, we use f1(y):={x𝔛:f(x)=y}f^{-1}(y):=\{x\in\mathfrak{X}:f(x)=y\} to denote the preimage of y𝔜y\in\mathfrak{Y}. We use the shorthand xyx\lesssim y (xyx\gtrsim y) to indicate xO(y)x\leq O(y) (xΩ(y)x\geq\Omega(y)).

3 Neural Network Function Approximation

In this section we show sample-efficient algorithms with neural network function approximations. The function class of interest is given in the following definition. More general neural network class is discussed in Appendix A.5.

Definition 3.1 (Neural Network Function Class).

We use NN{\cal F}_{NN} to denote the function class of f(ϕ(s,a)):𝒮×𝒜f(\phi(s,a)):\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R} where f(x)=v,σ(Wx):df(x)=\langle v,\sigma(Wx)\rangle:\mathbb{R}^{d}\mapsto\mathbb{R} is a two-layer neural network where σ\sigma is ReLU, WFBW\|W\|_{F}\leq B_{W}, v{±1}kv\in\{\pm 1\}^{k}, i=1ksi(W)/smin(W)λ\prod_{i=1}^{k}s_{i}(W)/s_{\min}(W)\leq\lambda, smax(W)/smin(W)κs_{\max}(W)/s_{\min}(W)\leq\kappa and kdk\leq d. Here ϕ:𝒜×𝒮d\phi:\mathcal{A}\times\mathcal{S}\mapsto\mathbb{R}^{d} is a known feature map whose image contains a ball {ud:u2δϕ}\{u\in\mathbb{R}^{d}:\|u\|_{2}\leq\delta_{\phi}\} with δϕdpolylog(d)\delta_{\phi}\geq d\cdot\mathrm{polylog}(d).222Here the δϕ\delta_{\phi} is chosen only for simplicity. In general this assumption can be relaxed to that the image of ϕ\phi contains an arbitrary dense ball near the origin, since one can always rescale the feature mapping in the neural function approximation.

We introduce the following completeness properties in the setting of value function approximations. Along with Assumption 2.1, they are commonly adopted in the literature .

Definition 3.2 (Policy complete).

Given MDP =(𝒮,𝒜,,r,H)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,H), function class h:𝒮×𝒜,h[H]{\cal F}_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R},h\in[H] is called policy complete if for all π\pi and h[H]h\in[H], QhπhQ^{\pi}_{h}\in{\cal F}_{h}.

Definition 3.3 (Bellman complete).

Given MDP =(𝒮,𝒜,,r,H)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathbb{P},r,H), function class h:𝒮×𝒜,h[H]{\cal F}_{h}:\mathcal{S}\times\mathcal{A}\mapsto\mathbb{R},h\in[H] is called Bellman complete if for all h[H]h\in[H] and Qh+1h+1Q_{h+1}\in{\cal F}_{h+1}, 𝒯h(Qh+1)h{\cal T}_{h}(Q_{h+1})\in{\cal F}_{h}.

3.1 Warmup: Realizable QQ^{\ast} with deterministic transition

We start by considering the case when the transition kernel is deterministic. In this case only Assumption 2.1 is required for the expressiveness of neural network function approximations. Algorithm 1 learns optimal policy from time step HH to 11. Suppose we have learned policies πh+1,,πH\pi_{h+1},\dots,\pi_{H} at level hh and they are exactly the optimal policies. We first explore features ϕ(shi,ahi)\phi(s_{h}^{i},a_{h}^{i}) over a standard Gaussian distribution, and if ϕ(shi,ahi)2δϕ\|\phi(s_{h}^{i},a_{h}^{i})\|_{2}\geq\delta_{\phi} then we simply skip this trial. Recall that δϕdpolylog(d)\delta_{\phi}\geq d\cdot\mathrm{poly}\log(d), so with high probability (w.r.t dd) almost all feature samples will be explored. We next construct an estimate Q^hi\widehat{Q}_{h}^{i} of Q(shi,ahi)Q^{\ast}(s^{i}_{h},a^{i}_{h}) by collecting cumulative rewards using πh+1,,πH\pi_{h+1},\dots,\pi_{H} as the roll-out. Since the transition is deterministic, Q^hi=Q(shi,ahi)\widehat{Q}_{h}^{i}=Q^{\ast}(s^{i}_{h},a^{i}_{h}) for all explored samples (shi,ahi)(s^{i}_{h},a^{i}_{h}). Recall that QhQ^{\ast}_{h} is a two-layer neural network, we can now recover its parameters in Line 12 exactly by invoking techniques in neural network optimization (see, e.g. [JSA15, GLM17, ZSJ+17]). Details of this step can be found in Appendix A.5, where the method is mainly based on [ZSJ+17]. This means the reconstructed Q^h\widehat{Q}_{h} in Line 13 is precisely QQ^{\ast}, and the algorithm can thus find optimal policy πh\pi^{\ast}_{h} in the hh-th level.

Algorithm 1 Learning realizable QQ^{\ast} with deterministic transition
1:for h=H,1h=H,\ldots 1 do
2:     Sample xhi,i[n]x^{i}_{h},i\in[n] from standard Gaussian 𝒩(0,Id)\mathcal{N}(0,I_{d})
3:     for i[n]i\in[n] do
4:         if xhiδϕ\|x^{i}_{h}\|\leq\delta_{\phi} then
5:              Find (shi,ahi)ϕ1(xhi)(s^{i}_{h},a^{i}_{h})\in\phi^{-1}(x^{i}_{h}) and locate the state shis^{i}_{h} in the generative model
6:              Pull action ahia^{i}_{h} and use πh+1,,πH\pi_{h+1},\dots,\pi_{H} as the roll-out to collect rewards rh(i),,rH(i)r_{h}^{(i)},\dots,r_{H}^{(i)}
7:              Construct estimation
Q^hirh(i)++rH(i){\widehat{Q}}_{h}^{i}\leftarrow r_{h}^{(i)}+\cdots+r_{H}^{(i)}
8:         else
9:              Let Q^hi0{\widehat{Q}}_{h}^{i}\leftarrow 0.
10:         end if
11:     end for
12:     Compute (vh,Wh)NeuralNetRecovery({(xhi,Q^hi):i[n]})(v_{h},W_{h})\leftarrow\textsc{NeuralNetRecovery}(\{(x^{i}_{h},{\widehat{Q}}_{h}^{i}):i\in[n]\})
13:     Set Q^h(s,a)vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)\leftarrow v_{h}^{\top}\sigma(W_{h}\phi(s,a))
14:     Let πh(s)argmaxa𝒮Q^h(s,a)\pi_{h}(s)\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{S}}\leavevmode\nobreak\ \widehat{Q}_{h}(s,a)
15:end for
16:Return π1,,πH\pi_{1},\dots,\pi_{H}
Theorem 3.4.

(Informal) If ndpoly(κ,k,λ,logd,BW,Bϕ,H)n\geq d\cdot\mathrm{poly}(\kappa,k,\lambda,\log d,B_{W},B_{\phi},H), then with high probability Algorithm 1 learns the optimal policy.

The formal statement and complete proof are deferred to the Appendix A.1. The main idea of exact neural network recovery can be summarized in the following. We first use method of moments to find a ‘rough’ parameter recovery. If this ‘rough’ recovery is sufficiently close to the true parameter, the empirical 2\ell_{2} loss function is locally strongly convex and there is unique global minimum. Then we can apply gradient descent to find this global minimum which is exactly the true parameter.

3.2 Policy complete neural function approximation

Now we consider general stochastic transitions. Difficulties arise in this scenario due to noises in the estimation of Q-functions. In the presence of model misspecification, these noises cause estimation errors to amplify through levels and require samples to be exponential in HH. In this section, we show that neural network function approximation is still learnable, assuming the function class NN{\cal F}_{NN} is policy complete with regard to MDP \mathcal{M}. Thus for all πΠ\pi\in\Pi, we can denote Qhπ(s,a)=vπ,σ(Wπϕ(s,a))Q_{h}^{\pi}(s,a)=\langle v^{\pi},\sigma(W^{\pi}\phi(s,a))\rangle.

Algorithm 2 Learn policy complete NN with simulator.
1:for h=H,1h=H,\ldots 1 do
2:     Sample xhi,i[n]x^{i}_{h},i\in[n] from standard Gaussian 𝒩(0,Id)\mathcal{N}(0,I_{d})
3:     for i[n]i\in[n] do
4:         if xhiδϕ\|x^{i}_{h}\|\leq\delta_{\phi} then
5:              Find (shi,ahi)ϕ1(xhi)(s^{i}_{h},a^{i}_{h})\in\phi^{-1}(x^{i}_{h}) and locate the state shis^{i}_{h} in the generative model
6:              Pull action ahia^{i}_{h} and use πh+1,,πH\pi_{h+1},\dots,\pi_{H} as the roll-out to collect rewards rh(i),,rH(i)r_{h}^{(i)},\dots,r_{H}^{(i)}
7:              Construct unbiased estimation of Qhπh+1,,πH(shi,ahi){Q}_{h}^{\pi_{h+1},\dots,\pi_{H}}(s^{i}_{h},a^{i}_{h})
Q^hirh(i)++rH(i){\widehat{Q}}_{h}^{i}\leftarrow r_{h}^{(i)}+\cdots+r_{H}^{(i)}
8:         else
9:              Let Q^hi0{\widehat{Q}}_{h}^{i}\leftarrow 0.
10:         end if
11:     end for
12:     Compute (vh,Wh)NeuralNetNoisyRecovery({(xhi,Q^hi):i[n]})(v_{h},W_{h})\leftarrow\textsc{NeuralNetNoisyRecovery}(\{(x^{i}_{h},{\widehat{Q}}_{h}^{i}):i\in[n]\})
13:     Set Q^h(s,a)vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)\leftarrow v_{h}^{\top}\sigma(W_{h}\phi(s,a))
14:     Let πh(s)argmaxa𝒮Q^h(s,a)\pi_{h}(s)\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{S}}\leavevmode\nobreak\ \widehat{Q}_{h}(s,a)
15:end for
16:Return π1,,πH\pi_{1},\dots,\pi_{H}

Algorithm 2 learns policy from level H,H1,,1H,H-1,\dots,1. In level hh, the algorithm has learned policy πh+1,,πH\pi_{h+1},\dots,\pi_{H} that is only sub-optimal by (Hh)ϵ/H(H-h)\epsilon/H. Then it explores features ϕ(s,a)\phi(s,a) from 𝒩(0,Id)\mathcal{N}(0,I_{d}). The algorithm then queries (s,a)(s,a) and uses learned policy πh+1,,πH\pi_{h+1},\dots,\pi_{H} as roll out to collect an unbiased estimate of the Q-function Qhπh+1,,πH(s,a){Q}_{h}^{\pi_{h+1},\dots,\pi_{H}}(s,a). Since Qhπh+1,,πH(s,a)NN{Q}_{h}^{\pi_{h+1},\dots,\pi_{H}}(s,a)\in{\cal F}_{NN} is a two-layer neural network, it can then be recovered from samples. Details of this step can be found in Appendix A.5, where the methods are mainly based on [ZSJ+17]. The algorithm then reconstructs this Q-function and finds its optimal policy πh\pi_{h}.

Theorem 3.5.

(Informal) Fix ϵ,t\epsilon,t, if nϵ2dpoly(κ,k,BW,Bϕ,H,log(d/t))n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 2 returns an ϵ\epsilon-optimal policy π\pi.

The formal statement and complete proof are deferred to Appendix A.2. Notice that unlike the case of Theorem 3.4, the sample complexity does not depend on λ\lambda, thus avoiding the potential exponential dependence in kk.

The main idea of the proof is that at each time step a neural network surrogate of QQ^{\ast} can be constructed by the policy already learned. Suppose we have learned πh+1,,πH\pi_{h+1},\dots,\pi_{H} in level hh, then from policy completeness Qhπh+1,,πHQ^{\pi_{h+1},\dots,\pi_{H}}_{h} belongs to NN{\cal F}_{NN} and we can interact with the simulator to obtain its estimate Q^h\widehat{Q}_{h}. If Q^hQhπh+1,,πH\|\widehat{Q}_{h}-Q^{\pi_{h+1},\dots,\pi_{H}}_{h}\|_{\infty} is small, the optimistic planning based on Q^h\widehat{Q}_{h} is not far from the optimal policy of Qhπh+1,,πHQ^{\pi_{h+1},\dots,\pi_{H}}_{h}. Therefore the errors can be decoupled into the errors in recovering Qhπh+1,,πHQ^{\pi_{h+1},\dots,\pi_{H}}_{h} and the suboptimality of Qhπh+1,,πHQ^{\pi_{h+1},\dots,\pi_{H}}_{h}, which depends on level h+1h+1. This reasoning can then be recursively performed to level HH, and thus we can bound the suboptimality of πh\pi_{h}.

3.3 Bellman complete neural function approximation

In addition to policy completeness, we show that neural network function approximation can also learn efficiently under the setting where the function class NN{\cal F}_{NN} is Bellman complete with regard to MDP \mathcal{M}. Specifically, for Qh+1h+1Q_{h+1}\in{\cal F}_{h+1}, there are vQh+1v^{Q_{h+1}} and WQh+1W^{Q_{h+1}} such that 𝒯h(Qh+1)(s,a)=vQh+1,σ(WQh+1ϕ(s,a)){\cal T}_{h}(Q_{h+1})(s,a)=\langle v^{Q_{h+1}},\sigma(W^{Q_{h+1}}\phi(s,a))\rangle.

Algorithm 3 is similar to the algorithm in previous section. Suppose in level hh, the algorithm has constructed the Q-function Q^h+1(s,a)=vh+1σ(Wh+1ϕ(s,a))\widehat{Q}_{h+1}(s,a)=v_{h+1}^{\top}\sigma(W_{h+1}\phi(s,a)) that is (Hh)ϵ/H(H-h)\epsilon/H-close to the optimal Qh+1Q^{\ast}_{h+1}. It then recovers weights vh,Whv_{h},W_{h} from 𝒯h(Q^h+1)(s,a)=vQ^h+1,σ(WQ^h+1ϕ(s,a)){\cal T}_{h}(\widehat{Q}_{h+1})(s,a)=\langle v^{\widehat{Q}_{h+1}},\sigma(W^{\widehat{Q}_{h+1}}\phi(s,a))\rangle, using unbiased estimates rh(shi,ahi)+V^h+1(sh+1i)r_{h}(s^{i}_{h},a^{i}_{h})+\widehat{V}_{h+1}(s_{h+1}^{i}). The Q-function Q^h(s,a)=vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)=v_{h}^{\top}\sigma(W_{h}\phi(s,a)) reconstructed from weights vh,Whv_{h},W_{h} is thus (Hh+1)ϵ/H(H-h+1)\epsilon/H-close to the QhQ^{\ast}_{h}.

Algorithm 3 Learn Bellman complete NN with simulator.
1:for h=H,1h=H,\ldots 1 do
2:     Sample xhi,i[n]x^{i}_{h},i\in[n] from standard Gaussian 𝒩(0,Id)\mathcal{N}(0,I_{d})
3:     for i[n]i\in[n] do
4:         if xhiδϕ\|x^{i}_{h}\|\leq\delta_{\phi} then
5:              Find (shi,ahi)ϕ1(xhi)(s^{i}_{h},a^{i}_{h})\in\phi^{-1}(x^{i}_{h}) and locate the state shis^{i}_{h} in the generative model
6:              Pull action ahia^{i}_{h} and and observe rh(shi,ahi),sh+1ir_{h}(s^{i}_{h},a^{i}_{h}),s_{h+1}^{i}
7:              Construct unbiased estimation of 𝒯h(Q^h+1)(shi,ahi){\cal T}_{h}(\widehat{Q}_{h+1})(s^{i}_{h},a^{i}_{h})
Q^hirh(shi,ahi)+V^h+1(sh+1i){\widehat{Q}}_{h}^{i}\leftarrow r_{h}(s^{i}_{h},a^{i}_{h})+\widehat{V}_{h+1}(s_{h+1}^{i})
8:         else
9:              Let Q^hi0{\widehat{Q}}_{h}^{i}\leftarrow 0.
10:         end if
11:     end for
12:     Compute (vh,Wh)NeuralNetNoisyRecovery({(xhi,Q^hi):i[n]})(v_{h},W_{h})\leftarrow\textsc{NeuralNetNoisyRecovery}(\{(x^{i}_{h},{\widehat{Q}}_{h}^{i}):i\in[n]\})
13:     Set Q^h(s,a)vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)\leftarrow v_{h}^{\top}\sigma(W_{h}\phi(s,a)) and V^hmaxa𝒜Q^h(s,a)\widehat{V}_{h}\leftarrow\max_{a\in\mathcal{A}}\widehat{Q}_{h}(s,a)
14:     Let πh(s)argmaxa𝒜Q^h(s,a)\pi_{h}(s)\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\leavevmode\nobreak\ \widehat{Q}_{h}(s,a)
15:end for
16:Return π1,,πH\pi_{1},\dots,\pi_{H}
Theorem 3.6.

(Informal) Fix ϵ,t\epsilon,t, if nϵ2dpoly(κ,k,BW,Bϕ,H,log(d/t))n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 3 returns an ϵ\epsilon-optimal policy π\pi.

Due to Bellman completeness, the error of estimation Q^h\widehat{Q}_{h} can be controlled recursively. In fact, we can show Q^hQ(s,a)\|\widehat{Q}_{h}-Q^{\ast}(s,a)\|_{\infty} is small by induction. The formal statement and detailed proof are deferred to Appendix A.3. Similar to Theorem 3.5, the sample complexity does not explicitly depend on λ\lambda, thus avoiding potentially exponential dependence in kk.

3.4 Realizable QQ^{\ast} with optimality gap

In this section we consider MDPs where there is a non-zero gap between the optimal policy and any other ones. This concept, known as optimality gap, is widely used in reinforcement learning and bandit literature [DKJ+19, DKWY20b, DLMW20].

Definition 3.7.

The optimality gap is defined as

ρ=infa:Q(s,a)V(s)V(s)Q(s,a).\displaystyle\rho=\inf_{a:Q^{\ast}(s,a)\neq V^{\ast}(s)}V^{\ast}(s)-Q^{\ast}(s,a).

We show that in the presence positive optimality gap, there exists an algorithm that can learn the optimal policy with polynomial samples even without the completeness assumptions. Intuitively, this is because one only needs to recover the neural network up to precision ρ/4\rho/4 in order to make sure the greedy policy is identical to the optimal one. The formal statement and proof are deferred to Appendix A.4.

Theorem 3.8.

(Informal) Fix t(0,1)t\in(0,1), if n=dρ2poly(κ,k,BW,Bϕ,H,log(d/t))n=\frac{d}{\rho^{2}}\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t there exists an algorithm that returns the optimal policy π\pi^{\ast}.

Remark 3.9.

In all aforementioned methods, there are two key components that allow efficient learning. First, the exploration is conducted in a way that guarantees an \ell_{\infty} recovery of candidate functions. By \ell_{\infty} recovery we mean the algorithm recovers a candidate Q-function in this class deviating from the target function QQ^{\ast} by at most ϵ\epsilon uniformly for all state-action pairs in the domain of interest. This notion of learning guarantee has received study in active learning [Han14, KAH+17] and recently gain interest in contextual bandits [FAD+18]. Second, the agent constructs unbiased estimators of certain approximations to QQ^{\ast} that lie in the neural function approximation class. This allows the recovery error to decouple linearly across time steps, which is made possible in several well-posed MDP instances, such as deterministic MDPs, MDPs with completeness assumptions, and MDPs with gap conditions. In principle, we note that provably efficient RL algorithms with general function approximation is possible as long as the above two components are present. We will see in the next section another example of learning RL with highly non-convex function approximations, where the function class of interest, admissible polynomial families, also allows for exploration schemes to achieve \ell_{\infty} recovery.

4 Polynomial Realizability

In this section, we study the sample complexity to learn deterministic MDPs under polynomial realizability. We identify sufficient and necessary conditions for efficiently learning the MDPs for two different settings — the generative model setting and the online RL setting. Specifically, we show that if the image of the feature map ϕh(sh,ah)\phi_{h}(s_{h},a_{h}) satisfies some positive measure conditions, then by random exploring, we can identify the optimal policy with samples linear in the algebraic dimension of the underlying polynomial class. We also provide a lower bound example showing the separation between the two settings.

Next, we introduce the notion of Admissible Polynomial Families, which are the families of structured polynomials that enable efficient learning.

Definition 4.1 (Admissible Polynomial Families).

For xdx\in\mathbb{R}^{d}, denote x~=[1,x]\widetilde{x}=[1,x^{\top}]^{\top}. Let 𝒳:={x~p:xd}\mathcal{X}:=\left\{\widetilde{x}^{\otimes p}:x\in\mathbb{R}^{d}\right\}. For any algebraic variety 𝒱{\mathcal{V}}, we define 𝒱:={fΘ(x)=Θ,x~p:Θ𝒱}{\cal F}_{{\mathcal{V}}}:=\{f_{\Theta}(x)=\left\langle\Theta,\widetilde{x}^{\otimes p}\right\rangle:\Theta\in\mathcal{V}\} as the polynomial family parameterized by Θ𝒱\Theta\in{\mathcal{V}}. We say 𝒱{\cal F}_{{\mathcal{V}}} is admissible333Admissible means the dimension of 𝒳\mathcal{X} decreases by one when there is an additional linear constraint Θ,X=0\langle\Theta,X\rangle=0 w.r.t. 𝒳\mathcal{X}, if for any Θ𝒱\Theta\in{\mathcal{V}}, dim(𝒳{X𝒳:X,Θ=0})<dim(𝒳)=d\text{dim}(\mathcal{X}\cap\{X\in\mathcal{X}:\langle X,\Theta\rangle=0\rangle\})<\text{dim}(\mathcal{X})=d. We define the dimension DD of the family to be the dimension of 𝒱{\mathcal{V}}.

The following theorem shows that to learn an admissible polynomial family, the sample complexity only scales with the algebraic dimension of the polynomial family.

Theorem 4.2 ([HHK+21]).

Consider the polynomial family 𝒱{\cal F}_{{\mathcal{V}}} of dimension DD. For n2Dn\geq 2D, there exists a Lebesgue-measure zero set Nd×dN\in\mathbb{R}^{d}\times\ldots\mathbb{R}^{d}, such that if (x1,,xn)N(x_{1},\cdots,x_{n})\notin N, for any yiy_{i}, there is a unique ff (or no such ff) to the system of equations yi=f(xi)y_{i}=f(x_{i}) for f𝒱f\in{\cal F}_{{\mathcal{V}}} .

We give two important examples of admissible polynomial families with low dimension.

Example 4.3.

(Low-rank Polynomial of rank kk) The function f𝒱f\in\mathcal{F}_{{\mathcal{V}}} is a polynomial with kk terms, that is

F(x)=i=1kλivi,xpi,F(x)=\sum_{i=1}^{k}\lambda_{i}\langle v_{i},x\rangle^{p_{i}},

where p=max{pi}p=\max\{p_{i}\}. The dimension of this family is upper bounded by DdkD\leq dk. Neural network with monomial/polynomial activation functions are low-rank polynomials.

Example 4.4.

The function f𝒱f\in\mathcal{F}_{\mathcal{V}} is of the form f(x)=q(Ux)f(x)=q(Ux), where Uk×dU\in\mathbb{R}^{k\times d} and qq is a degree pp polynomial. The polynomial qq and matrix UU are unknown. The dimension of this family is upper bounded by Dd(k+1)pD\leq d(k+1)^{p}.

Next, we introduce the notion of positive measure.

Definition 4.5.

We say a measurable set EdE\in\mathbb{R}^{d} is of positive measure if μ(E)>0\mu(E)>0, where μ\mu is the standard Lebesgue measure on d\mathbb{R}^{d}.

If a measurable set EE satisfies μ(E)>0\mu(E)>0, then there exists a procedure to draw samples from EE, such that for any NdN\subset\mathbb{R}^{d} of Lebesgue-measure zero, the probability that the sample falls in NN is zero. In fact, the sampling probability can be given by x𝒩(0,Id)(|xE)\mathbb{P}_{x\in\mathcal{N}(0,I_{d})}(\cdot|x\in E). The intuition behind its definition is that for all admissible polynomial families, the set of (x1,,xn)(x_{1},\cdots,x_{n}) with "redundant information" about learning the parameter Θ\Theta is of Lebesgue-measure zero. Therefore, a positive measure set allows you to query randomly and avoids getting coherent measurements.

Next two theorems identify the sufficent conditions for efficiently learning deterministic MDPs under polynomial realizability. Specifically, under online RL setting, we require the strong assumption that the set {ϕh(s,a)|a𝒜}\{\phi_{h}(s,a)|a\in{\mathcal{A}}\} is of positive measure for all h[H]h\in[H] and all s𝒮s\in{\mathcal{S}}, while under generative model setting, we only require the union set s𝒮{ϕh(s,a)|a𝒜}\bigcup_{s\in{\mathcal{S}}}\{\phi_{h}(s,a)|a\in{\mathcal{A}}\} to be of positive measure for all h[H]h\in[H]. The algorithms for solving the both cases are summarized in Algorithms 4 and 5.

Assumption 4.6 (Polynomial Realizability).

For all h[H]h\in[H], Qh(sh,ah)Q^{\ast}_{h}(s_{h},a_{h}), viewed as the function of ϕh(sh,ah)\phi_{h}(s_{h},a_{h}), lies in some admissible polynomial family 𝒱h{\cal F}_{{\mathcal{V}}_{h}} with dimension bounded by DD.

Theorem 4.7.

For the generative model setting, assume that the set {ϕh(s,a)s𝒮,a𝒜}\{\phi_{h}(s,a)\mid s\in{\mathcal{S}},a\in{\mathcal{A}}\} is of positive measure at any level hh. Under the polynomial realizability, Algorithm 4 almost surely learns the optimal policy π\pi^{\ast} with at most N=2DHN=2DH samples.

Theorem 4.8.

For the online RL setting, assume that {ϕh(s,a)a𝒜}\{\phi_{h}(s,a)\mid a\in{\mathcal{A}}\} is of positive measure for every state ss at every level hh. Under polynomial realizability, within T=2DHT=2DH episodes, Algorithm 5 learns the optimal policy π\pi^{\ast} almost surely.

Algorithm 4 Dynamic programming under generative model settings
1:for h=H,,1h=H,\cdots,1 do
2:     Sample 2D2D points {ϕh(sh(i),ah(i))}i=12D\{\phi_{h}(s_{h}^{(i)},a_{h}^{(i)})\}_{i=1}^{2D} according to x𝒩(0,Id)(|xEh)\mathbb{P}_{x\in\mathcal{N}(0,I_{d})}(\cdot|x\in E_{h}) where Eh={ϕh(s,a)s𝒮,a𝒜}E_{h}=\{\phi_{h}(s,a)\mid s\in{\mathcal{S}},a\in{\mathcal{A}}\}.
3:     Query the generative model with state-action pair (sh(i),ah(i))(s_{h}^{(i)},a_{h}^{(i)}) at level hh for i=1,,2Di=1,\dots,2D, and observe the next state s~h(i)\widetilde{s}_{h}^{(i)} and reward rh(i)r_{h}^{(i)}.
4:     Solve for QhQ^{*}_{h} with the 2D2D equations Qh(sh(i),ah(i))=rh(i)+Vh+1(s~h(i))Q^{*}_{h}(s_{h}^{(i)},a_{h}^{(i)})=r_{h}^{(i)}+V^{*}_{h+1}(\widetilde{s}_{h}^{(i)}).
5:     Set πh(s)=argmaxaQh(s,a)\pi^{*}_{h}(s)=\operatorname*{arg\,max}_{a}Q^{*}_{h}(s,a) and Vh(s)=maxaQh(s,a)V_{h}^{*}(s)=\max_{a}Q^{*}_{h}(s,a).
6:end for
7:Output π\pi^{*}
Algorithm 5 Dynamic programming under online RL settings
1:for h=H,,1h=H,\cdots,1 do
2:     Fix any action sequence a1,,ah1a_{1},\cdots,a_{h-1}.
3:     Play a1,,ah1a_{1},\cdots,a_{h-1} for the first h1h-1 levels and reach a state shs_{h}. Sample 2D2D points {ϕh(sh,ah(i))}i=12D\{\phi_{h}(s_{h},a_{h}^{(i)})\}_{i=1}^{2D} according to x𝒩(0,Id)(|xEh)\mathbb{P}_{x\in\mathcal{N}(0,I_{d})}(\cdot|x\in E_{h}) where Eh={ϕh(sh,a)a𝒜}E_{h}=\{\phi_{h}(s_{h},a)\mid a\in{\mathcal{A}}\}.
4:     Play ah(i)a_{h}^{(i)} at shs_{h} for i=1,,2Di=1,\dots,2D, and observe the next state s~h(i)\widetilde{s}_{h}^{(i)} and reward rh(i)r_{h}^{(i)}.
5:     Solve for QhQ^{*}_{h} with the 2D2D equations Qh(sh(i),ah(i))=rh(i)+Vh+1(s~h(i))Q^{*}_{h}(s_{h}^{(i)},a_{h}^{(i)})=r_{h}^{(i)}+V^{*}_{h+1}(\widetilde{s}_{h}^{(i)}).
6:     Set πh(s)=argmaxaQh(s,a)\pi^{*}_{h}(s)=\operatorname*{arg\,max}_{a}Q^{*}_{h}(s,a) and Vh(s)=maxaQh(s,a)V_{h}^{*}(s)=\max_{a}Q^{*}_{h}(s,a).
7:end for
8:Output π\pi^{*}

We remark that our Theorem 4.8 for learning MDPs under the online RL setting relies on a very strong assumption that allows the learner to explore randomly for any state. However, this assumption is necessary in some sense, as is suggested by our lower bound example in the next subsection.

4.1 Necessity of Generic Feature Maps in Online RL

In this section, we consider lower bounds for learning deterministic MDPs with polynomial realizable QQ^{*} under online RL setting. Our goal is to show that in the online setting the generic assumption on the feature maps ϕh(s,)\phi_{h}(s,\cdot) is necessary. On the contrary, under the generative model setting one can efficiently learn the MDPs without such a strong assumption, since at every level hh the we can set the state arbitrarily.

MDP construction

We briefly introduce the intuition of our construction. Consider a family of MDPs with only two states 𝒮={Sgood,Sbad}{\mathcal{S}}=\{S_{\text{good}},S_{\text{bad}}\}. we set the feature map ϕh\phi_{h} such that, for the good state SgoodS_{\text{good}}, it allows the learner to explore randomly, i.e., {ϕh(Sgood,a)a𝒜}\{\phi_{h}(S_{\text{good}},a)\mid a\in{\mathcal{A}}\} is of postive measure.

However, for the bad state SbadS_{\text{bad}}, all actions are mapped to some restricted set, which forbids random exploration, i.e., {ϕh(Sbad,a)a𝒜}\{\phi_{h}(S_{\text{bad}},a)\mid a\in{\mathcal{A}}\} is measure zero. This is illustrated in Figure 1.

Specifically, at least Ω(dp)\Omega(d^{p}) actions are needed to identify the groud-truth polynomial of QhQ_{h}^{*} for SbadS_{\text{bad}}, while O(d)O(d) actions suffice for SgoodS_{\text{good}}.

The transition h\mathbb{P}_{h} is constructed as h(sbad|s,a)=1 for all s𝒮,a𝒜\mathbb{P}_{h}(s_{\text{bad}}|s,a)=1\text{ for all }s\in{\mathcal{S}},a\in{\mathcal{A}}, which means it is impossible for the online scenarios to reach the good state for h>1h>1.

Refer to caption
Figure 1: An illustration of the hard case for deterministic MDPs with polynomial realizable QQ^{*}. The image of the feature map ϕh\phi_{h} at SgoodS_{\text{good}} is of positive measure, while the image of ϕh\phi_{h} at SbadS_{\text{bad}} is not. This makes it difficult to learn under the online RL setting.
Theorem 4.9.

There exists a family of MDPs satisfying Assumption 4.6, such that the set {ϕh(s,a)s𝒮,a𝒜}\left\{\phi_{h}(s,a)\mid s\in{\mathcal{S}},a\in{\mathcal{A}}\right\} is of positive measure at any level hh, but for all hh there is some sbad𝒮s_{\mathrm{bad}}\in{\mathcal{S}} such that {ϕh(sbad,a)a𝒜}\{\phi_{h}(s_{\mathrm{bad}},a)\mid a\in{\mathcal{A}}\} is measure zero. Under the online RL setting, any algorithm needs to play at least Ω(dp)\Omega(d^{p}) episodes to identify the optimal policy. On the contrary, under the generative model setting, only O(d)HO(d)H samples are needed.

5 Conclusions

In this paper, we consider neural network and polynomial function approximations in the simulator and online settings. To our knowledge, this is the first paper that shows sample-efficient reinforcement learning is possible with neural net function approximation. Our results substantially improve upon what can be achieved with existing results that primarily rely on embedding neural networks into linear function classes. The analysis reveals that for function approximations that allows for efficient \ell_{\infty} recovery, such as two layer neural networks and admissible polynomial families, reinforcement learning can be reduced to parameter recovery problems, as well-studied in theories for deep learning, phase retrieval, and etc. Our method can also be potentially extended to handle three-layer and deeper neural networks, with advanced tools in [FKR19, FVD19].

Our results for polynomial activation require deterministic transitions, since we cannot handle how noise propagates in solving polynomial equations. We leave to future work an in-depth study of the stability of roots of polynomial systems with noise, which is a fundamental mathematical problem and even unsolved for homogeneous polynomials. In particular, noisy tensor decomposition approaches combined with zeroth-order optimization may allow for stochastic transitions [HHK+21].

In the online RL setting, we can only show efficient learning under a very strong yet necessary assumption on the feature mapping. We leave to future work identifying more realistic and natural conditions which permit efficient learning in the online RL setting.

Finally, in future work, we hope to consider deep neural networks where parameter recovery or \ell_{\infty} error is unattainable, and deep reinforcement learning with representation learning [YHLD20, DHK+20].

Acknowledgment

JDL acknowledges support of the ARO under MURI Award W911NF-11-1-0303, the Sloan Research Fellowship, NSF CCF 2002272, and an ONR Young Investigator Award. QL is supported by NSF 2030859 and the Computing Research Association for the CIFellows Project. SK acknowledges funding from the NSF Award CCF-1703574 and the ONR award N00014-18-1-2247. BH is supported by the Elite Undergraduate Training Program of School of Mathematical Sciences at Peking University.

References

  • [AOM17] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pages 263–272, 2017.
  • [AYLSW19] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced POLITEX. arXiv preprint arXiv:1908.10479, 2019.
  • [AYPS11] Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
  • [AZL19] Zeyuan Allen-Zhu and Yuanzhi Li. What can resnet learn efficiently, going beyond kernels? arXiv preprint arXiv:1905.10337, 2019.
  • [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), pages 242–252, 2019.
  • [BBC+19] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemyslaw Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, and Chris Hesse. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • [BCR13] Jacek Bochnak, Michel Coste, and Marie-Françoise Roy. Real algebraic geometry, volume 36. Springer Science & Business Media, 2013.
  • [BL20] Yu Bai and Jason D. Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2020.
  • [CBL+20] Minshuo Chen, Yu Bai, Jason D Lee, Tuo Zhao, Huan Wang, Caiming Xiong, and Richard Socher. Towards understanding hierarchical learning: Benefits of neural representations. Neural Information Processing Systems (NeurIPS), 2020.
  • [DHK+20] Simon S Du, Wei Hu, Sham M Kakade, Jason D Lee, and Qi Lei. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  • [DJK+18] Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. On oracle-efficient PAC-RL with rich observations. In Advances in Neural Information Processing Systems, 2018.
  • [DKJ+19] Simon S Du, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miroslav Dudik, and John Langford. Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674, 2019.
  • [DKL+21a] Simon S. Du, Sham M. Kakade, Jason D. Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in rl. arXiv preprint arXiv:2103.10897, 2021.
  • [DKL+21b] Simon S Du, Sham M Kakade, Jason D Lee, Shachar Lovett, Gaurav Mahajan, Wen Sun, and Ruosong Wang. Bilinear classes: A structural framework for provable generalization in RL. arXiv preprint arXiv:2103.10897, 2021.
  • [DKWY20a] Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
  • [DKWY20b] Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficient reinforcement learning? In International Conference on Learning Representations, 2020.
  • [DLL+19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning (ICML), pages 1675–1685. PMLR, 2019.
  • [DLMW20] Simon S Du, Jason D Lee, Gaurav Mahajan, and Ruosong Wang. Agnostic Q-learning with function approximation in deterministic systems: Tight bounds on approximation error and sample complexity. Advances in Neural Information Processing Systems, 2020.
  • [DML21] Alex Damian, Tengyu Ma, and Jason Lee. Label noise sgd provably prefers flat global minimizers. arXiv preprint arXiv:2106.06530, 2021.
  • [DPWZ20] Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. n\sqrt{n}-regret for learning in Markov decision processes with function approximation and low Bellman rank. In Conference on Learning Theory, pages 1554–1557. PMLR, 2020.
  • [DYM21] Kefan Dong, Jiaqi Yang, and Tengyu Ma. Provable model-based nonlinear bandit and reinforcement learning: Shelve optimism, embrace virtual curvature. arXiv preprint arXiv:2102.04168, 2021.
  • [FAD+18] Dylan J. Foster, Alekh Agarwal, Miroslav Dudík, Haipeng Luo, and Robert E. Schapire. Practical contextual bandits with regression oracles. Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [FKR19] Massimo Fornasier, Timo Klock, and Michael Rauchensteiner. Robust and resource efficient identification of two hidden layer neural networks. arXiv preprint arXiv:1907.00485, 2019.
  • [FLYZ20] Cong Fang, Jason D Lee, Pengkun Yang, and Tong Zhang. Modeling from features: a mean-field framework for over-parameterized deep neural networks. arXiv preprint arXiv:2007.01452, 2020.
  • [FVD19] Massimo Fornasier, Jan Vybíral, and Ingrid Daubechies. Robust and resource efficient identification of shallow neural networks by fewest samples. arXiv preprint arXiv:1804.01592, 2019.
  • [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • [GCL+19] Ruiqi Gao, Tianle Cai, Haochuan Li, Liwei Wang, Cho-Jui Hsieh, and Jason D Lee. Convergence of adversarial training in overparametrized networks. Neural Information Processing Systems (NeurIPS), 2019.
  • [GLM17] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
  • [GLM18a] Rong Ge, Jason D. Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. In International Conference on Learning Representations (ICLR), 2018.
  • [GLM18b] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. International Conference on Learning Representations (ICLR), 2018.
  • [GMMM19] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
  • [GMMM21] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. The Annals of Statistics, 49(2):1029–1054, 2021.
  • [Han14] Steve Hanneke. Active learning for cost-sensitive classification. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
  • [HHK+21] Baihe Huang, Kaixuan Huang, Sham M Kakade, Jason D Lee, Qi Lei, Runzhe Wang, and Jiaqi Yang. Optimal gradient-based algorithms for non-concave bandit optimization. arXiv preprint arXiv:2107.04518, 2021.
  • [HIB+18] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [HWLM20] Jeff Z. HaoChen, Colin Wei, Jason D. Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. arXiv preprint arXiv:2006.08680, 2020.
  • [JAZBJ18] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.
  • [JKA+17] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, pages 1704–1713, 2017.
  • [JLM21] Chi Jin, Qinghua Liu, and Sobhan Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. arXiv preprint arXiv:2102.00875, 2021.
  • [JOA10] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • [JSA15] Majid Janzamin, Hanie Sedghi, and Anankumar Anima. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  • [JYWJ20a] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • [JYWJ20b] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143, 2020.
  • [KAH+17] Akshay Krishnamurthy, Alekh Agarwal, Tzu-Kuo Huang, Hal Daume III, and John Langford. Active learning for cost-sensitive classification. Proceedings of the 34th International Conference on Machine Learning, 2017.
  • [Kak03] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of College London, 2003.
  • [KCL15] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via matrix factorization. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), page 507–516, 2015.
  • [LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [LCYW19] Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306, 2019.
  • [LHP+15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • [LKFS21] Gene Li, Pritish Kamath, Dylan J. Foster, and Nathan Srebro. Eluder dimension and generalized rank. arXiv preprint arXiv:2104.06970, 2021.
  • [MGW+20] Edward Moroshko, Suriya Gunasekar, Blake Woodworth, Jason D Lee, Nathan Srebro, and Daniel Soudry. Implicit bias in deep linear classification: Initialization scale vs training accuracy. Neural Information Processing Systems (NeurIPS), 2020.
  • [Mil17] James S. Milne. Algebraic geometry (v6.02), 2017. Available at www.jmilne.org/math/.
  • [MKS+13a] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, DaanWierstra, and Martin Riedmiller. Playing atari with deep reinforcement learningep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [MKS+13b] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [MPSL21] Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, and Yuanzhi Li. Sample efficient reinforcement learning in continuous state spaces: A perspective beyond linearity. arXiv preprint arXiv:2106.07814, 2021.
  • [MSB+17] Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisỳ, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
  • [NGL+19] Mor Shpigel Nacson, Suriya Gunasekar, Jason Lee, Nathan Srebro, and Daniel Soudry. Lexicographic and depth-sensitive margins in homogeneous and non-homogeneous deep models. In International Conference on Machine Learning, pages 4683–4692. PMLR, 2019.
  • [ORVR13] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • [RST15a] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning via sequential complexities. Journal of Machine Learning Research, 16(6):155–186, 2015.
  • [RST15b] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1-2):111–153, 2015.
  • [Rus19] Daniel Russo. Worst-case regret bounds for exploration via randomized value functions. In Advances in Neural Information Processing Systems, pages 14433–14443, 2019.
  • [RVR13a] Dan Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, pages 2256–2264, 2013.
  • [RVR13b] Daniel Russo and Benjamin Van Roy. Eluder dimension and the sample complexity of optimistic exploration. In Advances in Neural Information Processing Systems, 2013.
  • [SB18] Richard S Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [Sha13] Igor R Shafarevich. Basic Algebraic Geometry 1: Varieties in Projective Space. Springer Science & Business Media, 2013.
  • [SHM+16] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • [SJK+19] Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based RL in contextual decision processes: PAC bounds and exponential improvements over model-free approaches. In Conference on Learning Theory, pages 2898–2933, 2019.
  • [SLW+06] Alexander L Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L Littman. PAC model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM, 2006.
  • [SSSS16] Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin F Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492, 2018.
  • [WAS20] Gellert Weisz, Philip Amortila, and Csaba Szepesvári. Exponential lower bounds for planning in mdps with linearly-realizable optimal action-value functions. arXiv preprint arXiv:2010.01374, 2020.
  • [WCYW20] Lingxiao Wang, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2020.
  • [WGL+19] Blake Woodworth, Suriya Genesekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Kernel and deep regimes in overparametrized models. In Conference on Learning Theory (COLT), 2019.
  • [WLLM19] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709–9721, 2019.
  • [WSY20a] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. arXiv preprint arXiv:2005.10804, 2020.
  • [WSY20b] Ruosong Wang, Ruslan Salakhutdinov, and Lin F Yang. Provably efficient reinforcement learning with general value function approximation. Advances in Neural Information Processing Systems, 2020.
  • [WVR13] Zheng Wen and Benjamin Van Roy. Efficient exploration and value function generalization in deterministic systems. In Advances in Neural Information Processing Systems, pages 3021–3029, 2013.
  • [WVR17] Zheng Wen and Benjamin Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Math. Oper. Res., 42(3):762–782, 2017.
  • [WWDK21] Yining Wang, Ruosong Wang, Simon S. Du, and Akshay Krishnamurthy. Optimism in reinforcement learning with generalized linear function approximation. In International Conference on Learning Representations, 2021.
  • [WWK21] Yuanhao Wang, Ruosong Wang, and Sham M. Kakade. An exponential lower bound for linearly-realizable mdps with constant suboptimality gap. arXiv preprint arXiv:2103.12690, 2021.
  • [WWL+20] Xiang Wang, Chenwei Wu, Jason D Lee, Tengyu Ma, and Rong Ge. Beyond lazy training for over-parameterized tensor decomposition. Neural Information Processing Systems (NeurIPS), 2020.
  • [WX19] Yang Wang and Zhiqiang Xu. Generalized phase retrieval: measurement number, matrix recovery and beyond. Applied and Computational Harmonic Analysis, 47(2):423–446, 2019.
  • [YHLD20] Jiaqi Yang, Wei Hu, Jason D Lee, and Simon S Du. Provable benefits of representation learning in linear bandits. arXiv preprint arXiv:2010.06531, 2020.
  • [YJW+20] Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael Jordan. Provably efficient reinforcement learning with kernel and neural function approximations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13903–13916. Curran Associates, Inc., 2020.
  • [YS19] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. arXiv preprint arXiv:1904.00687, 2019.
  • [YW20] Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. International Conference on Machine Learning, 2020.
  • [ZB19] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
  • [ZCZG18] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.
  • [ZLG20] Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. Proceedings of the 37th International Conference on Machine Learning, 2020.
  • [ZLKB20] Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, and Emma Brunskill. Learning near optimal policies with low inherent Bellman error. In International Conference on Machine Learning, 2020.
  • [ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, and Inderjit S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. Proceedings of the Thirty-fourth International Conference on Machine Learning (ICML), 70, 2017.

Appendix A Omitted Proofs in Section 3

A.1 Proof of Section 3.1

Theorem A.1 (Formal statement of Theorem 3.4).

Consider MDP \mathcal{M} where the transition is deterministic. Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. For any t(0,1)t\in(0,1), if dΩ(log(BW/λ))d\geq\Omega(\log(B_{W}/\lambda)) and ndpoly(κ,k,λ,BW,Bϕ,H,log(d/t))n\geq d\cdot\mathrm{poly}(\kappa,k,\lambda,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 1 returns the optimal policy π\pi^{\ast}.

Proof.

Use π1,,πH\pi^{\ast}_{1},\dots,\pi^{\ast}_{H} to denote the global optimal policy. We prove that Algorithm 1 learns πh\pi_{h}^{\ast} from h=Hh=H to h=1h=1.

At level HH, the query obtains exact QH(s,a)Q_{H}^{\ast}(s,a). Therefore by Theorem A.15, Q^H=QH\widehat{Q}_{H}=Q_{H}^{\ast} and thus the optimal planning finds πH=πH\pi_{H}=\pi_{H}^{\ast}. Suppose we have learned πh+1,,πH\pi_{h+1}^{\ast},\dots,\pi_{H}^{\ast} at level hh. Due to deterministic transition, the query obtains exact Qh(s,a)Q_{h}^{\ast}(s,a). Therefore by Theorem A.15, Q^h=Qh\widehat{Q}_{h}=Q_{h}^{\ast} and thus the optimal planning finds πh=πh\pi_{h}=\pi_{h}^{\ast}. Recursively applying this process to h=1h=1, we complete the proof. ∎

A.2 Proof of Section 3.2

Theorem A.2 (Formal statement of Theorem 3.5).

Assume the function class in Definition 3.1 satisfies Assumption 2.1, Assumption 2.2 and is policy complete. For any ϵ>0\epsilon>0 and t(0,1)t\in(0,1) such that dΩ(log(BWBϕ/ϵ))d\geq\Omega(\log(B_{W}B_{\phi}/\epsilon)), if nϵ2dpoly(κ,k,BW,Bϕ,H,log(d/t))n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 2 returns a policy π\pi such that VVπϵV^{\ast}-V^{\pi}\leq\epsilon.

Proof.

Use π1,,πH\pi^{\ast}_{1},\dots,\pi^{\ast}_{H} to denote the global optimal policy. We prove for all s𝒮s\in\mathcal{S},

Vhπh,πh+1,,πH(s)Vhπh,πh+1,,πH(s)(Hh+1)ϵH.\displaystyle V^{\pi^{\ast}_{h},\pi^{\ast}_{h+1},\dots,\pi^{\ast}_{H}}_{h}(s)-V^{\pi_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)\leq\frac{(H-h+1)\epsilon}{H}.

At level HH, let eH(sHi,aHi)=rH(sHi,aHi)QH(sHi,aHi)e_{H}(s^{i}_{H},a^{i}_{H})=r_{H}(s^{i}_{H},a^{i}_{H})-Q^{\ast}_{H}(s^{i}_{H},a^{i}_{H}), then eH(sHi,aHi)=0e_{H}(s^{i}_{H},a^{i}_{H})=0. From Theorem A.13, we have Q^H(s,a):=vHσ(WHϕ(s,a))\widehat{Q}_{H}(s,a):=v_{H}^{\top}\sigma(W_{H}\phi(s,a)) satisfies |Q^H(s,a)QH(s,a)|ϵ2H|\widehat{Q}_{H}(s,a)-Q^{\ast}_{H}(s,a)|\leq\frac{\epsilon}{2H} for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}. Therefore for all s𝒮s\in\mathcal{S},

VH(s)VHπH(s)=\displaystyle V_{H}^{\ast}(s)-V^{\pi_{H}}_{H}(s)= 𝔼aπH[QH(s,a)]𝔼aπH[Q^H(s,a)]\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi^{\ast}_{H}}[Q^{\ast}_{H}(s,a)]-\mathbb{E}_{a\sim\pi^{\ast}_{H}}[\widehat{Q}_{H}(s,a)]
+𝔼aπH[Q^H(s,a)]𝔼aπH[Q^H(s,a)]\displaystyle\leavevmode\nobreak\ +\mathbb{E}_{a\sim\pi^{\ast}_{H}}[\widehat{Q}_{H}(s,a)]-\mathbb{E}_{a\sim\pi_{H}}[\widehat{Q}_{H}(s,a)]
+𝔼aπH[Q^H(s,a)]𝔼aπH[QH(s,a)]\displaystyle\leavevmode\nobreak\ +\mathbb{E}_{a\sim\pi_{H}}[\widehat{Q}_{H}(s,a)]-\mathbb{E}_{a\sim\pi_{H}}[Q^{\ast}_{H}(s,a)]
\displaystyle\leq ϵH\displaystyle\leavevmode\nobreak\ \frac{\epsilon}{H}

where in the second step we used 𝔼aπH[Q^H(s,a)]𝔼aπH[Q^H(s,a)]\mathbb{E}_{a\sim\pi^{\ast}_{H}}[\widehat{Q}_{H}(s,a)]\leq\mathbb{E}_{a\sim\pi_{H}}[\widehat{Q}_{H}(s,a)] by optimality of πH\pi_{H} and |Q^H(s,a)QH(s,a)|ϵ2H|\widehat{Q}_{H}(s,a)-Q^{\ast}_{H}(s,a)|\leq\frac{\epsilon}{2H}.

Suppose we have learned policies πh+1,,πH\pi_{h+1},\dots,\pi_{H}, we use π~h\widetilde{\pi}_{h} to denote the optimal policy of Qhπh+1,,πH(s,a)Q^{\pi_{h+1},\dots,\pi_{H}}_{h}(s,a). Let

eh(shi,ahi)=Q^hiQhπh+1,,πH(shi,ahi)e_{h}(s^{i}_{h},a^{i}_{h})={\widehat{Q}}_{h}^{i}-Q_{h}^{\pi_{h+1},\dots,\pi_{H}}(s^{i}_{h},a^{i}_{h})

then eh(shi,ahi)e_{h}(s^{i}_{h},a^{i}_{h}) is zero mean H2H^{2} sub-Gaussian (notice that Q^hi{\widehat{Q}}_{h}^{i} is unbiased estimate of Qhπh+1,,πH(shi,ahi)Q_{h}^{\pi_{h+1},\dots,\pi_{H}}(s^{i}_{h},a^{i}_{h}), and Q^hiO(H){\widehat{Q}}_{h}^{i}\leq O(H)). From Theorem A.13, we have Q^h(s,a)=vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)=v_{h}^{\top}\sigma(W_{h}\phi(s,a)) satisfies |Q^h(s,a)Qhπh+1,,πH(s,a)|ϵ2H|\widehat{Q}_{h}(s,a)-Q_{h}^{\pi_{h+1},\dots,\pi_{H}}(s,a)|\leq\frac{\epsilon}{2H} for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}. Therefore for all s𝒮s\in\mathcal{S},

Vhπ~h,πh+1,,πH(s)Vhπh,πh+1,,πH(s)\displaystyle\leavevmode\nobreak\ V^{\widetilde{\pi}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)-V^{\pi_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)
=\displaystyle= 𝔼aπ~h[Qhπh+1,,πH(s,a)]𝔼aπ~h[Q^h(s,a)]\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\widetilde{\pi}_{h}}[Q_{h}^{\pi_{h+1},\dots,\pi_{H}}(s,a)]-\mathbb{E}_{a\sim\widetilde{\pi}_{h}}[\widehat{Q}_{h}(s,a)]
+𝔼aπ~h[Q^h(s,a)]𝔼aπh[Q^h(s,a)]\displaystyle\leavevmode\nobreak\ +\mathbb{E}_{a\sim\widetilde{\pi}_{h}}[\widehat{Q}_{h}(s,a)]-\mathbb{E}_{a\sim\pi_{h}}[\widehat{Q}_{h}(s,a)]
+𝔼aπh[Q^h(s,a)]𝔼aπh[Qhπh,πh+1,,πH(s,a)]\displaystyle\leavevmode\nobreak\ +\mathbb{E}_{a\sim\pi_{h}}[\widehat{Q}_{h}(s,a)]-\mathbb{E}_{a\sim\pi_{h}}[Q_{h}^{\pi_{h},\pi_{h+1},\dots,\pi_{H}}(s,a)]
\displaystyle\leq ϵH\displaystyle\leavevmode\nobreak\ \frac{\epsilon}{H}

where in the second step we used 𝔼aπ~h[Q^h(s,a)]𝔼aπh[Q^h(s,a)]\mathbb{E}_{a\sim\widetilde{\pi}_{h}}[\widehat{Q}_{h}(s,a)]\leq\mathbb{E}_{a\sim\pi_{h}}[\widehat{Q}_{h}(s,a)] by optimality of πh\pi_{h} and |Q^h(s,a)Qhπh+1,,πH(s,a)|ϵ2H|\widehat{Q}_{h}(s,a)-Q_{h}^{\pi_{h+1},\dots,\pi_{H}}(s,a)|\leq\frac{\epsilon}{2H}.

It thus follows that

Vhπh,πh+1,,πH(s)Vhπh,πh+1,,πH(s)=\displaystyle V^{\pi^{\ast}_{h},\pi^{\ast}_{h+1},\dots,\pi^{\ast}_{H}}_{h}(s)-V^{\pi_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)= Vhπh,πh+1,,πH(s)Vhπh,πh+1,,πH(s)\displaystyle\leavevmode\nobreak\ V^{\pi^{\ast}_{h},\pi^{\ast}_{h+1},\dots,\pi^{\ast}_{H}}_{h}(s)-V^{\pi^{\ast}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)
+Vhπh,πh+1,,πH(s)Vhπ~h,πh+1,,πH(s)\displaystyle\leavevmode\nobreak\ +V^{\pi^{\ast}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)-V^{\widetilde{\pi}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)
+Vhπ~h,πh+1,,πH(s)Vhπh,πh+1,,πH(s)\displaystyle\leavevmode\nobreak\ +V^{\widetilde{\pi}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)-V^{\pi_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)
\displaystyle\leq Vhπh,πh+1,,πH(s)Vhπh,πh+1,,πH(s)+ϵH\displaystyle\leavevmode\nobreak\ V^{\pi^{\ast}_{h},\pi^{\ast}_{h+1},\dots,\pi^{\ast}_{H}}_{h}(s)-V^{\pi^{\ast}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)+\frac{\epsilon}{H}
\displaystyle\leq \displaystyle\leavevmode\nobreak\ \cdots
\displaystyle\leq (Hh+1)ϵH.\displaystyle\leavevmode\nobreak\ \frac{(H-h+1)\epsilon}{H}.

where in the second step we use Vhπh,πh+1,,πH(s)Vhπ~h,πh+1,,πH(s)V^{\pi^{\ast}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s)\leq V^{\widetilde{\pi}_{h},\pi_{h+1},\dots,\pi_{H}}_{h}(s) from optimality of π~h\widetilde{\pi}_{h}. Repeating this argument to h=1h=1 completes the proof ∎

A.3 Proof of Section 3.3

Theorem A.3 (Formal statement of Theorem 3.6).

Assume the function class in Definition 3.1 satisfies Assumption 2.1, Assumption 2.2, and is Bellman complete. For any ϵ>0\epsilon>0 and t(0,1)t\in(0,1) such that dΩ(log(BWBϕ/ϵ))d\geq\Omega(\log(B_{W}B_{\phi}/\epsilon)), if nϵ2dpoly(κ,k,BW,Bϕ,H,log(d/t))n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 3 returns a policy π\pi such that VVπϵV^{\ast}-V^{\pi}\leq\epsilon.

Proof.

Use π1,,πH\pi^{\ast}_{1},\dots,\pi^{\ast}_{H} to denote the global optimal policy. We prove

|Q^h(s,a)Qh(s,a)|(Hh+1)ϵH\displaystyle|\widehat{Q}_{h}(s,a)-Q^{\ast}_{h}(s,a)|\leq\frac{(H-h+1)\epsilon}{H} (1)

for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

At level HH, let

eH(sHi,aHi)=rH(sHi,aHi)QH(sHi,aHi)e_{H}(s^{i}_{H},a^{i}_{H})=r_{H}(s^{i}_{H},a^{i}_{H})-Q^{\ast}_{H}(s^{i}_{H},a^{i}_{H})

then eH(sHi,aHi)=0e_{H}(s^{i}_{H},a^{i}_{H})=0 . From Theorem A.13, we have Q^H(s,a):=vHσ(WHϕ(s,a))\widehat{Q}_{H}(s,a):=v_{H}^{\top}\sigma(W_{H}\phi(s,a)) satisfies |Q^H(s,a)QH(s,a)|ϵH|\widehat{Q}_{H}(s,a)-Q^{\ast}_{H}(s,a)|\leq\frac{\epsilon}{H} for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

Suppose we have learned Q^h+1(s,a)\widehat{Q}_{h+1}(s,a) with |Q^h+1(s,a)Qh+1(s,a)|(Hh)ϵH|\widehat{Q}_{h+1}(s,a)-Q^{\ast}_{h+1}(s,a)|\leq\frac{(H-h)\epsilon}{H}. At level hh, let

eh(shi,ahi)=rh(shi,ahi)+V^h+1(sh+1i)𝒯h(Q^h+1)(shi,ahi)e_{h}(s^{i}_{h},a^{i}_{h})=r_{h}(s^{i}_{h},a^{i}_{h})+\widehat{V}_{h+1}(s_{h+1}^{i})-{\cal T}_{h}(\widehat{Q}_{h+1})(s^{i}_{h},a^{i}_{h})

then eh(shi,ahi)e_{h}(s^{i}_{h},a^{i}_{h}) is zero mean H2H^{2} sub-Gaussian (notice that rh(shi,ahi)+V^h+1(sh+1i)r_{h}(s^{i}_{h},a^{i}_{h})+\widehat{V}_{h+1}(s_{h+1}^{i}) is unbiased estimate of 𝒯h(Q^h+1)(shi,ahi){\cal T}_{h}(\widehat{Q}_{h+1})(s^{i}_{h},a^{i}_{h}), and rh(shi,ahi)+V^h+1(sh+1i)O(H)r_{h}(s^{i}_{h},a^{i}_{h})+\widehat{V}_{h+1}(s_{h+1}^{i})\leq O(H)). From Theorem A.13, we have Q^h(s,a):=vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a):=v_{h}^{\top}\sigma(W_{h}\phi(s,a)) satisfies |Q^h(s,a)𝒯h(Q^h+1)(shi,ahi)|ϵH|\widehat{Q}_{h}(s,a)-{\cal T}_{h}(\widehat{Q}_{h+1})(s^{i}_{h},a^{i}_{h})|\leq\frac{\epsilon}{H} for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}. Therefore

|Q^h(s,a)Qh(s,a)|\displaystyle|\widehat{Q}_{h}(s,a)-Q^{\ast}_{h}(s,a)|\leq |Q^h(s,a)𝒯h(Q^h+1)(s,a)|+|𝒯h(Q^h+1)(s,a)Qh(s,a)|\displaystyle\leavevmode\nobreak\ |\widehat{Q}_{h}(s,a)-{\cal T}_{h}(\widehat{Q}_{h+1})(s,a)|+|{\cal T}_{h}(\widehat{Q}_{h+1})(s,a)-Q^{\ast}_{h}(s,a)|
\displaystyle\leq ϵH+maxs𝒮,a𝒜|Q^h+1(s,a)Qh+1(s,a)|\displaystyle\leavevmode\nobreak\ \frac{\epsilon}{H}+\max_{s\in\mathcal{S},a\in\mathcal{A}}|\widehat{Q}_{h+1}(s,a)-Q^{\ast}_{h+1}(s,a)|
\displaystyle\leq (Hh+1)ϵH\displaystyle\leavevmode\nobreak\ \frac{(H-h+1)\epsilon}{H}

holds for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

It thus follows that for all s1𝒮s_{1}\in\mathcal{S},

Vhπ1,,πH(s1)Vhπ1,,πH(s1)=\displaystyle V^{\pi^{\ast}_{1},\dots,\pi^{\ast}_{H}}_{h}(s_{1})-V^{\pi_{1},\dots,\pi_{H}}_{h}(s_{1})= 𝔼aπ1[Q1(s1,a)]𝔼aπ1[Q1π2,,πH(s1,a)]\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi^{\ast}_{1}}[Q_{1}^{\ast}(s_{1},a)]-\mathbb{E}_{a\sim\pi_{1}}[Q_{1}^{\pi_{2},\dots,\pi_{H}}(s_{1},a)]
\displaystyle\leq 𝔼aπ1[Q^1(s1,a)]𝔼aπ1[Q1π2,,πH(s1,a)]+ϵ\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi^{\ast}_{1}}[\widehat{Q}_{1}(s_{1},a)]-\mathbb{E}_{a\sim\pi_{1}}[Q_{1}^{\pi_{2},\dots,\pi_{H}}(s_{1},a)]+\epsilon
\displaystyle\leq 𝔼aπ1[Q^1(s1,a)Q1π2,,πH(s1,a)]+ϵ\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi_{1}}[\widehat{Q}_{1}(s_{1},a)-Q_{1}^{\pi_{2},\dots,\pi_{H}}(s_{1},a)]+\epsilon
\displaystyle\leq 𝔼aπ1[Q1(s1,a)Q1π2,,πH(s1,a)]+2ϵ\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi_{1}}[Q_{1}^{\ast}(s_{1},a)-Q_{1}^{\pi_{2},\dots,\pi_{H}}(s_{1},a)]+2\epsilon
\displaystyle\leq 𝔼aπ1𝔼s2(|s,a)[V2π2,,πH(s2)V2π2,,πH(s2)]+2ϵ\displaystyle\leavevmode\nobreak\ \mathbb{E}_{a\sim\pi_{1}}\mathbb{E}_{s_{2}\sim\mathbb{P}(\cdot|s,a)}[V_{2}^{\pi^{\ast}_{2},\dots,\pi^{\ast}_{H}}(s_{2})-V_{2}^{\pi_{2},\dots,\pi_{H}}(s_{2})]+2\epsilon
\displaystyle\leq \displaystyle\leavevmode\nobreak\ \cdots
\displaystyle\leq 2Hϵ\displaystyle\leavevmode\nobreak\ 2H\epsilon

where the first step comes from definition of value function; the second step comes from Eq. (1); the third step comes from optimality of π1\pi_{1}; the fourth step comes from Eq. (1); the fifth step comes from Bellman equation. The proof is complete by rescaling ϵϵ/H\epsilon\leftarrow\epsilon/H. ∎

A.4 Proof of Section 3.4

With gap condition, either Algorithm 2 or Algorithm 3 will work as long as we select ϵρ\epsilon\approx\rho. The following displays an adaption from Algorithm 2.

Algorithm 6 Learning realizable QQ^{\ast} with optimality gap
1:for h=H,1h=H,\ldots 1 do
2:     Sample xhi,i[n]x^{i}_{h},i\in[n] from standard Gaussian 𝒩(0,Id)\mathcal{N}(0,I_{d})
3:     for i[n]i\in[n] do
4:         if xhiδϕ\|x^{i}_{h}\|\leq\delta_{\phi} then
5:              Find (shi,ahi)ϕ1(xhi)(s^{i}_{h},a^{i}_{h})\in\phi^{-1}(x^{i}_{h}) and locate the state shis^{i}_{h} in the generative model
6:              Pull action ahia^{i}_{h} and use πh+1,,πH\pi_{h+1},\dots,\pi_{H} as the roll-out to collect rewards rh(i),,rH(i)r_{h}^{(i)},\dots,r_{H}^{(i)}
7:              Construct unbiased estimation of Qhπh+1,,πH(shi,ahi){Q}_{h}^{\pi_{h+1},\dots,\pi_{H}}(s^{i}_{h},a^{i}_{h})
Q^hirh(i)++rH(i){\widehat{Q}}_{h}^{i}\leftarrow r_{h}^{(i)}+\cdots+r_{H}^{(i)}
8:         else
9:              Let Q^hi0{\widehat{Q}}_{h}^{i}\leftarrow 0.
10:         end if
11:     end for
12:     Compute (vh,Wh)NeuralNetNoisyRecovery({(xhi,Q^hi):i[n]})(v_{h},W_{h})\leftarrow\textsc{NeuralNetNoisyRecovery}(\{(x^{i}_{h},{\widehat{Q}}_{h}^{i}):i\in[n]\})
13:     Set Q^h(s,a)vhσ(Whϕ(s,a))\widehat{Q}_{h}(s,a)\leftarrow v_{h}^{\top}\sigma(W_{h}\phi(s,a))
14:     Let πh(s)argmaxa𝒮Q^h(s,a)\pi_{h}(s)\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{S}}\leavevmode\nobreak\ \widehat{Q}_{h}(s,a)
15:end for
16:Return π1,,πH\pi_{1},\dots,\pi_{H}
Theorem A.4 (Formal statement of Theorem 3.8).

Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. Suppose ρ>0\rho>0 and dΩ(log(BWBϕ/ρ))d\geq\Omega(\log(B_{W}B_{\phi}/\rho)), for any t(0,1)t\in(0,1), if n=dρ2poly(κ,k,BW,Bϕ,H,log(d/t))n=\frac{d}{\rho^{2}}\cdot\mathrm{poly}(\kappa,k,B_{W},B_{\phi},H,\log(d/t)), then with probability at least 1t1-t Algorithm 6 returns the optimal policy π\pi^{\ast}.

Proof.

Use π1,,πH\pi^{\ast}_{1},\dots,\pi^{\ast}_{H} to denote the global optimal policy. Similar to Theorem A.1, we prove that Algorithm 6 learns πh\pi_{h}^{\ast} from h=Hh=H to h=1h=1.

At level HH, the algorithm uses n=dρ2poly(κ,k,logd,BW,Bϕ,H,log(1/t))n=\frac{d}{\rho^{2}}\cdot\mathrm{poly}(\kappa,k,\log d,B_{W},B_{\phi},H,\log(1/t)) trajectories to obtain Q^H\widehat{Q}_{H} such that |Q^H(s,a)Q(s,a)|ρ/4|\widehat{Q}_{H}(s,a)-Q^{\ast}(s,a)|\leq\rho/4 by Theorem A.13. Therefore

VH(s)QH(s,πH(s))\displaystyle\leavevmode\nobreak\ V^{\ast}_{H}(s)-Q^{\ast}_{H}(s,\pi_{H}(s))
\displaystyle\leq QH(s,πH(s))QH(s,πH(s))\displaystyle\leavevmode\nobreak\ Q^{\ast}_{H}(s,\pi^{\ast}_{H}(s))-Q^{\ast}_{H}(s,\pi_{H}(s))
\displaystyle\leq QH(s,πH(s))Q^H(s,πH(s))+Q^H(s,πH(s))Q^H(s,πH(s))\displaystyle\leavevmode\nobreak\ Q^{\ast}_{H}(s,\pi^{\ast}_{H}(s))-\widehat{Q}_{H}(s,\pi^{\ast}_{H}(s))+\widehat{Q}_{H}(s,\pi^{\ast}_{H}(s))-\widehat{Q}_{H}(s,\pi_{H}(s))
+Q^H(s,πH(s))QH(s,πH(s))\displaystyle\leavevmode\nobreak\ +\widehat{Q}_{H}(s,\pi_{H}(s))-Q^{\ast}_{H}(s,\pi_{H}(s))
\displaystyle\leq ρ/2\displaystyle\leavevmode\nobreak\ \rho/2

where the third inequality uses the optimality of πH(s)\pi_{H}(s) under Q^H\widehat{Q}_{H}. Thus Definition 3.7 gives πH(s)=πH(s)\pi_{H}(s)=\pi^{\ast}_{H}(s). Suppose we have learned πh+1,,πH\pi_{h+1}^{\ast},\dots,\pi_{H}^{\ast} at level hh. We apply the same argument to derive πh=πh\pi_{h}=\pi^{\ast}_{h}. Recursively applying this process to h=1h=1, we complete the proof. ∎

A.5 Neural network recovery

This section considers recovering neural network v,σ(Wx)\langle v,\sigma(Wx)\rangle from the following two models, where B=Ω(dpolylog(d))B=\Omega(d\cdot\mathrm{poly}\log(d)).

  • Noisy samples from

    x𝒩(0,Id),y=(v,σ(Wx)+ξ)𝟙(xB)\displaystyle x\sim\mathcal{N}(0,I_{d}),\leavevmode\nobreak\ \leavevmode\nobreak\ y=(\langle v,\sigma(Wx)\rangle+\xi)\cdot\mathbbm{1}(\|x\|\leq B) (2)

    where ξ\xi is ϑ\vartheta sub-Gaussian noise.

  • Noiseless samples from

    x𝒩(0,Id),y=(v,σ(Wx))𝟙(xB)\displaystyle x\sim\mathcal{N}(0,I_{d}),\leavevmode\nobreak\ \leavevmode\nobreak\ y=(\langle v,\sigma(Wx)\rangle)\cdot\mathbbm{1}(\|x\|\leq B) (3)

Recovering neural network has received comprehensive study in deep learning theory [JSA15, ZSJ+17, GLM17]. The analysis in this section is mainly based on the method of moments in [ZSJ+17]. However, notice that the above learning tasks are different from those considered in [ZSJ+17], due to the presence of noise and the truncated signals. Therefore, additional considerations must be made in the analysis.

We consider more general homogeneous activation functions, specified by the assumptions that follow. Since the activation function is homogeneous, we assume vi{±1}v_{i}\in\{\pm 1\} in the following without loss of generality.

Assumption A.5 (Property 3.1 of [ZSJ+17]).

Assume σ(x)\sigma^{\prime}(x) is nonnegative and homogeneously bounded, i.e. 0σ(x)L1|x|p0\leq\sigma^{\prime}(x)\leq L_{1}|x|^{p} for some constants L1>0L_{1}>0 and p0p\geq 0.

Definition A.6 (Part of property 3.2 of [ZSJ+17]).

Define ρ(z):=min{β0(z)α02(z)α12(z),β2(z)α12(z)α22(z),α0(z)α2(z)α12(z)}\rho(z):=\min\{\beta_{0}(z)-\alpha_{0}^{2}(z)-\alpha_{1}^{2}(z),\beta_{2}(z)-\alpha_{1}^{2}(z)-\alpha_{2}^{2}(z),\alpha_{0}(z)\alpha_{2}(z)-\alpha_{1}^{2}(z)\}, where αq(z):=𝔼x𝒩(0,1)[σ(zx)xq],q{0,1,2}\alpha_{q}(z):=\mathbb{E}_{x\sim\mathcal{N}(0,1)}[\sigma^{\prime}(zx)x^{q}],q\in\{0,1,2\}, and βq(z):=𝔼x𝒩(0,1)[(σ)2(zx)xq]\beta_{q}(z):=\mathbb{E}_{x\sim\mathcal{N}(0,1)}[(\sigma^{\prime})^{2}(zx)x^{q}] for q{0,2}q\in\{0,2\}.

Assumption A.7 (Part of property 3.2 of [ZSJ+17]).

The first derivative σ(z)\sigma^{\prime}(z) satisfies that, for all z>0z>0, we have ρ(z)>0\rho(z)>0.

Assumption A.8 (Property 3.3 of [ZSJ+17]).

The second derivative σ′′(x)\sigma^{\prime\prime}(x) is either (a) globally bounded or (b) σ′′(x)=0\sigma^{\prime\prime}(x)=0 except for finite points.

Notice that ReLU, squared ReLU, leaky ReLU, and polynomial activation function functions all satisfies the above assumption. We make the following assumption on the dimension of feature vectors, which corresponds to how features can extract information about neural networks from noisy samples. The dimension only has to be greater than a logarithmic term in 1/ϵ1/\epsilon and the norm of parameters.

Assumption A.9 (Rich feature).

Assume dΩ(log(BW/ϵ))d\geq\Omega(\log(B_{W}/\epsilon)).

First we introduce a notation from [ZSJ+17].

Definition A.10.

Define outer product ~\tilde{\otimes} as follows. For a vector vdv\in\mathbb{R}^{d} and an identity matrix Id×dI\in\mathbb{R}^{d\times d},

v~I=j=1d[vejej+ejvej+ejejv].v\tilde{\otimes}I=\sum_{j=1}^{d}[v\otimes e_{j}\otimes e_{j}+e_{j}\otimes v\otimes e_{j}+e_{j}\otimes e_{j}\otimes v].

For a symmetric rank-rr matrix M=i=1rsiviviM=\sum_{i=1}^{r}s_{i}v_{i}v_{i}^{\top} and an identity matrix Id×dI\in\mathbb{R}^{d\times d},

M~I=i=1rsij=1dl=16Al,i,jM\tilde{\otimes}I=\sum_{i=1}^{r}s_{i}\sum_{j=1}^{d}\sum_{l=1}^{6}A_{l,i,j}

where A1,i,j=viviejejA_{1,i,j}=v_{i}\otimes v_{i}\otimes e_{j}\otimes e_{j}, A2,i,j=viejviejA_{2,i,j}=v_{i}\otimes e_{j}\otimes v_{i}\otimes e_{j}, A3,i,j=ejviviejA_{3,i,j}=e_{j}\otimes v_{i}\otimes v_{i}\otimes e_{j}, A4,i,j=viejejviA_{4,i,j}=v_{i}\otimes e_{j}\otimes e_{j}\otimes v_{i}, A5,i,j=ejviejviA_{5,i,j}=e_{j}\otimes v_{i}\otimes e_{j}\otimes v_{i}, A6,i,j=ejejviviA_{6,i,j}=e_{j}\otimes e_{j}\otimes v_{i}\otimes v_{i}.

Now we define some moments.

Definition A.11.

Define M1,M2,M3,M4,m1,i,m2,i,m3,i,m4,iM_{1},M_{2},M_{3},M_{4},m_{1,i},m_{2,i},m_{3,i},m_{4,i} as follows:

M1:=\displaystyle M_{1}:= 𝔼[yx]\displaystyle\leavevmode\nobreak\ \mathbb{E}[y\cdot x]
M2:=\displaystyle M_{2}:= 𝔼[y(xxI)]\displaystyle\leavevmode\nobreak\ \mathbb{E}[y\cdot(x\otimes x-I)]
M3:=\displaystyle M_{3}:= 𝔼[y(x3x~I)]\displaystyle\leavevmode\nobreak\ \mathbb{E}[y\cdot(x^{\otimes 3}-x\tilde{\otimes}I)]
M4:=\displaystyle M_{4}:= 𝔼[y(x4(xx)~I+I~I)]\displaystyle\leavevmode\nobreak\ \mathbb{E}[y\cdot(x^{\otimes 4}-(x\otimes x)\tilde{\otimes}I+I\tilde{\otimes}I)]
γj(x):=\displaystyle\gamma_{j}(x):= 𝔼z𝒩(0,1)[σ(xz)zj],j0,1,2,3,4\displaystyle\leavevmode\nobreak\ \mathbb{E}_{z\sim\mathcal{N}(0,1)}[\sigma(x\cdot z)z^{j}],\forall j\in 0,1,2,3,4
m1,i:=\displaystyle m_{1,i}:= γ1(wi)\displaystyle\leavevmode\nobreak\ \gamma_{1}(\|w_{i}\|)
m2,i:=\displaystyle m_{2,i}:= γ2(wi)γ0(wi)\displaystyle\leavevmode\nobreak\ \gamma_{2}(\|w_{i}\|)-\gamma_{0}(\|w_{i}\|)
m3,i:=\displaystyle m_{3,i}:= γ3(wi)3γ1(wi)\displaystyle\leavevmode\nobreak\ \gamma_{3}(\|w_{i}\|)-3\gamma_{1}(\|w_{i}\|)
m4,i:=\displaystyle m_{4,i}:= γ4(wi)+3γ0(wi)6γ2(wi)\displaystyle\leavevmode\nobreak\ \gamma_{4}(\|w_{i}\|)+3\gamma_{0}(\|w_{i}\|)-6\gamma_{2}(\|w_{i}\|)

The above expectations are all with respect to x𝒩(0,Id)x\sim\mathcal{N}(0,I_{d}) and y=v,σ(Wx)y=\langle v,\sigma(Wx)\rangle.

Assumption A.12 (Assumption 5.3 of [ZSJ+17]).

Assume the activation function satisfies the followings:

  • If Mi0M_{i}\neq 0, then mj,i0m_{j,i}\neq 0 for all i[k]i\in[k].

  • At least one of M3M_{3} and M4M_{4} is not zero.

  • If M1=M3=0M_{1}=M_{3}=0, then σ(z)\sigma(z) is an even function.

  • If M2=M4=0M_{2}=M_{4}=0, then σ(z)\sigma(z) is an odd function.

Now we state the theoretical result that recovers neural networks from noisy data.

Theorem A.13 (Neural network recovery from noisy data).

Let the activation function σ\sigma satisfies Assumption A.5 and Assumption A.12. Let κ\kappa be the condition number of WW. Given nn samples from Eq. (2). For any t(0,1)t\in(0,1) and ϵ(0,1)\epsilon\in(0,1) such that Assumption A.9 holds, if

nϵ2dpoly(κ,k,ϑ,log(d/t))n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,\vartheta,\log(d/t))

then there exists an algorithm that takes O~(nkd)\tilde{O}(nkd) time and returns a matrix W^k×d\widehat{W}\in\mathbb{R}^{k\times d} and a vector v^{±1}k\widehat{v}\in\{\pm 1\}^{k} such that with probability at least 1t1-t,

W^WFϵpoly(k,κ)WF, and v^=v.\displaystyle\|\widehat{W}-W\|_{F}\leq\epsilon\cdot\mathrm{poly}(k,\kappa)\cdot\|W\|_{F},\text{ and }\widehat{v}=v.

The algorithm and proof are shown in Appendix A.5.1. By Assumption A.5, the following corollary is therefore straightforward.

Corollary A.14.

In the same setting as Theorem A.13. For any t(0,1)t\in(0,1) and suppose WFBW\|W\|_{F}\leq B_{W} and Assumption A.9 holds. Given nn samples from Eq. (2). If

nϵ2dpoly(κ,k,log(d/t),BW,Bϕ,ϑ)n\geq\epsilon^{-2}\cdot d\cdot\mathrm{poly}(\kappa,k,\log(d/t),B_{W},B_{\phi},\vartheta)

then there exists an algorithm that takes O~(nkd)\tilde{O}(nkd) time and outputs a matrix W^k×d\widehat{W}\in\mathbb{R}^{k\times d} and a vector v^{±1}k\widehat{v}\in\{\pm 1\}^{k} such that with probability at least 1t1-t, for all x2Bϕ\|x\|_{2}\leq B_{\phi}

|v^,σ(W^x)v,σ(Wx)|ϵ.\displaystyle|\langle\widehat{v},\sigma(\widehat{W}x)\rangle-\langle v,\sigma(Wx)\rangle|\leq\epsilon.

In particular, when Bϕ=O(dpolylogd)B_{\phi}=O(d\cdot\mathrm{poly}\log d) the following sample complexity suffices

nϵ2dO(1+p)poly(κ,k,log(d/t),BW,ϑ).\displaystyle n\geq\epsilon^{-2}\cdot d^{O(1+p)}\cdot\mathrm{poly}(\kappa,k,\log(d/t),B_{W},\vartheta).

Now we state the theoretical result that precisely recovers neural networks from noiseless data. The proof and method are shown in Appendix A.5.2.

Theorem A.15 (Exact neural network recovery from noiseless data).

Let the activation function satisfies Assumption A.5 and Assumption A.12, Assumption A.7 and Assumption A.8(b). Given nn samples from Eq. (3). For any t(0,1)t\in(0,1), suppose dΩ(log(BW/λ))d\geq\Omega(\log(B_{W}/\lambda)) and

ndpoly(κ,k,λ,log(d/t)),\displaystyle n\geq d\cdot\mathrm{poly}(\kappa,k,\lambda,\log(d/t)),

then there exists an algorithm that output exact WW and vv with probability at least 1t1-t.

A.5.1 Recover neural networks from noisy data

In this section we prove Theorem A.13. Denote W=[w1,,wk]W=[w_{1},\cdots,w_{k}]^{\top} where widw_{i}\in\mathbb{R}^{d} and wi¯=wi/wi2\bar{w_{i}}=w_{i}/\|w_{i}\|_{2}.

Definition A.16.

Given a vector αd\alpha\in\mathbb{R}^{d}. Define P2:=Mj2(I,I,α,,α)P_{2}:=M_{j_{2}}(I,I,\alpha,\cdots,\alpha) where j2=min{j2:Mj0}j_{2}=\min\{j\geq 2:M_{j}\neq 0\} and P3:=Mj3(I,I,I,α,,α)P_{3}:=M_{j_{3}}(I,I,I,\alpha,\cdots,\alpha) where j3=min{j3:Mj0}j_{3}=\min\{j\geq 3:M_{j}\neq 0\}.

The method of moments is presented in Algorithm 7. Here we sketch its ideas, and refer readers to [ZSJ+17] for thorough explanations. There are three main steps. In the first step, it computes the span of the rows of WW. By power method, Line 7 finds the top-kk eigenvalues of CI+P^2CI+\widehat{P}_{2} and CIP^2CI-\widehat{P}_{2}. It then picks the largest kk eigenvalues from CI+P^2CI+\widehat{P}_{2} and CIP^2CI-\widehat{P}_{2}, by invoking TopK in Line 15. Finally it orthogonalizes the corresponding eigenvectors in Line 19 and finds an orthogonal matrix VV in the subspace spanned by {w¯1,,w¯k}\{\bar{w}_{1},\dots,\bar{w}_{k}\}.

In the second step, the algorithm forms third order tensor R3=Ps(V,V,V)k×k×kR_{3}=P_{s}(V,V,V)\in\mathbb{R}^{k\times k\times k} and use the robust tensor decomposition method in [KCL15] to find u^\widehat{u} that approximates siVw¯is_{i}V^{\top}\bar{w}_{i} with unknown signs sis_{i}. In the third step, the algorithm determines ss, vv and wi,i[k]w_{i},i\in[k]. Since the activation function is homogeneous, we assume vi{±1}v_{i}\in\{\pm 1\} and mj,i=cjwip+1m_{j,i}=c_{j}\|w_{i}\|^{p+1} for universal constants cjc_{j} without loss of generality. For illustration, we define Q1Q_{1} and Q2Q_{2} as follows.

Q1=Ml1(I,α,,α(l11)α’s)=i=1kvicl1wip+1(αw¯i)l11w¯i,\displaystyle Q_{1}=M_{l_{1}}(I,\underbrace{\alpha,\cdots,\alpha}_{(l_{1}-1)\leavevmode\nobreak\ \alpha\text{'s}})=\sum_{i=1}^{k}v_{i}c_{l_{1}}\|w_{i}\|^{p+1}(\alpha^{\top}\overline{w}_{i})^{{l_{1}-1}}\overline{w}_{i}, (4)
Q2=Ml2(V,V,α,,α(l22)α’s)=i=1kvicl2wip+1(αw¯i)l22(Vw¯i)(Vw¯i),\displaystyle Q_{2}=M_{l_{2}}(V,V,\underbrace{\alpha,\cdots,\alpha}_{(l_{2}-2)\leavevmode\nobreak\ \alpha\text{'s}})=\sum_{i=1}^{k}v_{i}c_{l_{2}}\|w_{i}\|^{p+1}(\alpha^{\top}\overline{w}_{i})^{{l_{2}-2}}(V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top}, (5)

where l11l_{1}\geq 1 such that Ml10M_{l_{1}}\neq 0 and l22l_{2}\geq 2 such that Ml20M_{l_{2}}\neq 0 are specified later. Then the solutions of the following linear systems

z=argminzki=1kzisiw¯iQ1,r=argminrki=1kriVw¯i(Vw¯i)Q2F.\displaystyle z^{*}=\operatorname*{arg\,min}_{z\in\mathbb{R}^{k}}\left\|\sum_{i=1}^{k}z_{i}s_{i}\overline{w}_{i}-Q_{1}\right\|,\leavevmode\nobreak\ \leavevmode\nobreak\ r^{*}=\operatorname*{arg\,min}_{r\in\mathbb{R}^{k}}\left\|\sum_{i=1}^{k}r_{i}V^{\top}\overline{w}_{i}(V^{\top}\overline{w}_{i})^{\top}-Q_{2}\right\|_{F}. (6)

are the followings

zi\displaystyle z^{*}_{i} =visil1cl1wip+1(αsiw¯i)l11,ri=visil2cl2wip+1(αsiw¯i)l22.\displaystyle=v_{i}s_{i}^{l_{1}}c_{l_{1}}\|w_{i}\|^{p+1}(\alpha^{\top}s_{i}\overline{w}_{i})^{{l_{1}-1}},\leavevmode\nobreak\ \leavevmode\nobreak\ r_{i}=v_{i}s_{i}^{l_{2}}c_{l_{2}}\|w_{i}\|^{p+1}(\alpha^{\top}s_{i}\overline{w}_{i})^{{l_{2}-2}}.

When cl1c_{l_{1}} and cl2c_{l_{2}} do not have the same sign, we can recover viv_{i} and sis_{i} by vi=sign(ricl2),si=sign(vizicl1)v_{i}=\operatorname{sign}({r}^{\ast}_{i}c_{l_{2}}),s_{i}=\operatorname{sign}(v_{i}{z}^{\ast}_{i}c_{l_{1}}), and recover wiw_{i} by

wi=(|zicl1(αsiw¯i)l11)|)1/(p+1)w¯i.\displaystyle{w}_{i}=\left(\left|\frac{z^{\ast}_{i}}{c_{l_{1}}(\alpha^{\top}s_{i}\overline{w}_{i})^{l_{1}-1})}\right|\right)^{1/(p+1)}\overline{w}_{i}.

In Algorithm 7, we use Vu^iV\widehat{u}_{i} to approximate siw¯is_{i}\overline{w}_{i}, and use moment estimators Q^1\widehat{Q}_{1} and Q^2\widehat{Q}_{2} to approximate Q1Q_{1} and Q1Q_{1}. Then the solutions z^,r^\widehat{z},\widehat{r} to the optimization problems in Line 29 should approximate zz^{*} and rr^{*}, due to robustness for solving linear systems. As such, the outputs v~,W~\widetilde{v},\widetilde{W} approximately recover the true model parameters.

Algorithm 7 Using method of moments to recover neural network parameters
1:procedure NeuralNetNoisyRecovery(S={(xi,yi):i[n]}S=\{(x_{i},y_{i}):i\in[n]\})
2:     Choose α\alpha to be a random unit vector
3:     Partition SS into S1,S2,S3,S4S_{1},S_{2},S_{3},S_{4} of equal size
4:     P^2𝔼S1[P2]\widehat{P}_{2}\leftarrow\mathbb{E}_{S_{1}}[P_{2}], C3P2C\leftarrow 3\|P_{2}\|, Tlog(1/ϵ)T\leftarrow\log(1/\epsilon)
5:     Choose V^1(0),V^1(0)d×k\widehat{V}_{1}^{(0)},\widehat{V}_{1}^{(0)}\in\mathbb{R}^{d\times k} to be random matrices \triangleright Estimate subspace VV
6:     for t=1,,Tt=1,\dots,T do
7:         V^1(t)QR(CV^1(t1)+P^2V^1(t1)),V^2(t)QR(CV^2(t1)P^2V^2(t1))\widehat{V}_{1}^{(t)}\leftarrow\mathrm{QR}(C\widehat{V}_{1}^{(t-1)}+\widehat{P}_{2}\widehat{V}_{1}^{(t-1)}),\widehat{V}_{2}^{(t)}\leftarrow\mathrm{QR}(C\widehat{V}_{2}^{(t-1)}-\widehat{P}_{2}\widehat{V}_{2}^{(t-1)})
8:     end for
9:     for j = 1,2 do
10:         V^1(T)[V^j,1,,V^j,k]\widehat{V}_{1}^{(T)}\leftarrow[\widehat{V}_{j,1},\cdots,\widehat{V}_{j,k}]
11:         for i[k]i\in[k] do
12:              λj,i|V^j,iP^2V^j,i|\lambda_{j,i}\leftarrow|\widehat{V}_{j,i}\widehat{P}_{2}\widehat{V}_{j,i}|
13:         end for
14:     end for
15:     π1,π2,k1,k2TopK(λ,k)\pi_{1},\pi_{2},k_{1},k_{2}\leftarrow\textsc{TopK}(\lambda,k)
16:     for j = 1,2 do
17:         Vj[V^j,πj(1),,V^j,πj(kj)]V_{j}\leftarrow[\widehat{V}_{j,\pi_{j}(1)},\cdots,\widehat{V}_{j,\pi_{j}(k_{j})}]
18:     end for
19:     V~2QR((IV1V1)V2)\tilde{V}_{2}\leftarrow\textsc{QR}((I-V_{1}V_{1}^{\top})V_{2}), V[V1,V~2]V\leftarrow[V_{1},\tilde{V}_{2}]
20:     R^3𝔼S2[P3(V,V,V)]\widehat{R}_{3}\leftarrow\mathbb{E}_{S_{2}}[P_{3}(V,V,V)], {u^i}i[k]TensorDecomposition(R^3)\{\widehat{u}_{i}\}_{i\in[k]}\leftarrow\textsc{TensorDecomposition}(\widehat{R}_{3}) \triangleright Learn siVw¯is_{i}V^{\top}\bar{w}_{i}
21:     if M1=M3=0M_{1}=M_{3}=0 then
22:         l1,l2=min{j{2,4}:Mj0}l_{1},l_{2}=\min\{j\in\{2,4\}:M_{j}\neq 0\}
23:     else if M2=M4=0M_{2}=M_{4}=0 then
24:         l1min{j{1,3}:Mj0}l_{1}\leftarrow\min\{j\in\{1,3\}:M_{j}\neq 0\}, l23l_{2}\leftarrow 3
25:     else
26:         l1min{j{1,3}:Mj0}l_{1}\leftarrow\min\{j\in\{1,3\}:M_{j}\neq 0\}, l2=min{j{2,4}:Mj0}l_{2}=\min\{j\in\{2,4\}:M_{j}\neq 0\}
27:     end if
28:     Q^1𝔼S3[Q1]\widehat{Q}_{1}\leftarrow\mathbb{E}_{S_{3}}[Q_{1}], Q^2𝔼S4[Q2]\widehat{Q}_{2}\leftarrow\mathbb{E}_{S_{4}}[Q_{2}]
29:     z^argminzi=1kziVu^iQ^1\widehat{z}\leftarrow\operatorname*{arg\,min}_{z}\|\sum_{i=1}^{k}z_{i}V\widehat{u}_{i}-\widehat{Q}_{1}\|, r^argminri=1kriu^iu^iQ^2F\widehat{r}\leftarrow\operatorname*{arg\,min}_{r}\|\sum_{i=1}^{k}r_{i}\widehat{u}_{i}\widehat{u}_{i}-\widehat{Q}_{2}\|_{F}
30:     for i=1,,ki=1,\dots,k do \triangleright Learn parameters v,Wv,W
31:         v^isign(r^icl2){\widehat{v}}_{i}\leftarrow\mathrm{sign}(\widehat{r}_{i}c_{l_{2}}), s^isign(v^iz^icl1){\widehat{s}}_{i}\leftarrow\mathrm{sign}({\widehat{v}}_{i}\widehat{z}_{i}c_{l_{1}})
32:         w^is^i(|z^icl1(αVu^i)l11)|)1/(p+1)Vu^i{\widehat{w}}_{i}\leftarrow{\widehat{s}}_{i}(|\frac{\widehat{z}_{i}}{c_{l_{1}}(\alpha^{\top}V\widehat{u}_{i})^{l_{1}-1})}|)^{1/(p+1)}V\widehat{u}_{i}
33:     end for
34:     W^[w^i,,w^k]{\widehat{W}}\leftarrow[{\widehat{w}}_{i},\cdots,{\widehat{w}}_{k}]
35:     Return (v^,W^)({\widehat{v}},{\widehat{W}})
36:end procedure

Since Algorithm 7 carries out the same computation as [ZSJ+17], the computational complexity is the same. The difference of sample complexity comes from the noise ξ\xi in the model and the truncation of standard Gaussian. The proof entails bounding the error in estimating P2P_{2} in Line 4, R3R_{3} in Line 20 and Q1,Q2Q_{1},Q_{2} in Line 28. In the following, unless further specified, the expectations are all with respect to x𝒩(0,Id)x\sim\mathcal{N}(0,I_{d}) and y(v,σ(Wx)+ξ)𝟙(xB)y\sim(\langle v,\sigma(Wx)\rangle+\xi)\cdot\mathbbm{1}(\|x\|\leq B).

Lemma A.17.

Let P^2\widehat{P}_{2} be computed in Line 4 of Algorithm 7 and P2P_{2} defined in Definition A.16. Suppose m0=mini[k]{|mj2,i|2(w¯iα)2(j22)}m_{0}=\min_{i\in[k]}\{|m_{j_{2},i}|^{2}(\overline{w}_{i}^{\top}\alpha)^{2(j_{2}-2)}\} and

|S|dpoly(κ,ϑ,log(d/t))/(ϵ2m0)\displaystyle|S|\gtrsim d\cdot\mathrm{poly}(\kappa,\vartheta,\log(d/t))/(\epsilon^{2}m_{0})

then with probability at least 1t1-t,

P2P^2ϵi=1k|vimj2,i(wi¯α)j22|+ϵ.\displaystyle\|P_{2}-\widehat{P}_{2}\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{j_{2},i}(\bar{w_{i}}^{\top}\alpha)^{j_{2}-2}|+\epsilon.
Proof.

It suffices to bound M2M^2\|M_{2}-\widehat{M}_{2}\|, M3(I,I,α)M^3(I,I,α)\|M_{3}(I,I,\alpha)-\widehat{M}_{3}(I,I,\alpha)\| and M4(I,I,α,α)M^4(I,I,α,α)\|M_{4}(I,I,\alpha,\alpha)-\widehat{M}_{4}(I,I,\alpha,\alpha)\|. The main strategy is to bound all relevant moment terms and to invoke Claim D.6.

Specifically, we show that with probability at least 1t/41-t/4,

M2M^2ϵi=1k|vim2,i|+ϵ.\displaystyle\|M_{2}-\widehat{M}_{2}\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{2,i}|+\epsilon. (7)
M3(I,I,α)M^3(I,I,α)ϵi=1k|vim3,i(w¯iα)|+ϵ.\displaystyle\|M_{3}(I,I,\alpha)-\widehat{M}_{3}(I,I,\alpha)\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{3,i}(\overline{w}_{i}^{\top}\alpha)|+\epsilon. (8)
M4(I,I,α,α)M^4(I,I,α,α)ϵi=1k|vim4,i|(w¯iα)2+ϵ.\displaystyle\|M_{4}(I,I,\alpha,\alpha)-\widehat{M}_{4}(I,I,\alpha,\alpha)\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{4,i}|(\overline{w}_{i}^{\top}\alpha)^{2}+\epsilon. (9)

Recall that for sample (xj,yj)S(x_{j},y_{j})\in S, yj=i=1kviσ(wixj)+ξjy_{j}=\sum_{i=1}^{k}v_{i}\sigma(w_{i}^{\top}x_{j})+\xi_{j} where ξj\xi_{j} is independent of xjx_{j}. Consider each component i[k]i\in[k]. Define Ci(xj),Bi(xj)d×dC_{i}(x_{j}),B_{i}(x_{j})\in\mathbb{R}^{d\times d} as follows:

Bi(xj)=\displaystyle B_{i}(x_{j})= (σ(wixj)+ξj)(xj4(xjxj)~I+I~I)(I,I,α,α)\displaystyle\leavevmode\nobreak\ (\sigma(w_{i}^{\top}x_{j})+\xi_{j})\cdot(x_{j}^{\otimes 4}-(x_{j}\otimes x_{j})\tilde{\otimes}I+I\tilde{\otimes}I)(I,I,\alpha,\alpha)
=\displaystyle= (σ(wixj)+ξj)((xα)2x2(αx)2I2(αx)(xα+αx)xx+2αα+I),\displaystyle\leavevmode\nobreak\ (\sigma(w_{i}^{\top}x_{j})+\xi_{j})\cdot((x^{\top}\alpha)^{2}x^{\otimes 2}-(\alpha^{\top}x)^{2}I-2(\alpha^{\top}x)(x\alpha^{\top}+\alpha x^{\top})-xx^{\top}+2\alpha\alpha^{\top}+I),

and Ci(xj)=𝟙(xjB)Bi(xj)C_{i}(x_{j})=\mathbbm{1}({\|x_{j}\|\leq B})\cdot B_{i}(x_{j}). Then from Claim D.5 we have 𝔼[Bi(xj)]=m4,i(w¯iα)2w¯iw¯i\mathbb{E}[B_{i}(x_{j})]=m_{4,i}(\overline{w}_{i}^{\top}\alpha)^{2}\overline{w}_{i}\overline{w}_{i}^{\top}. We calculate

σ(wixj)(xj4(xjxj)~I+I~I)(I,I,α,α)\displaystyle\leavevmode\nobreak\ \sigma(w_{i}^{\top}x_{j})\cdot(x_{j}^{\otimes 4}-(x_{j}\otimes x_{j})\tilde{\otimes}I+I\tilde{\otimes}I)(I,I,\alpha,\alpha)
\displaystyle\lesssim (|wixj|p+1+|ϕ(0)|)((xjα)2xj2+1+xj2+(αxj)2)\displaystyle(|w_{i}^{\top}x_{j}|^{p+1}+|\phi(0)|)\cdot((x_{j}^{\top}\alpha)^{2}\|x_{j}\|^{2}+1+\|x_{j}\|^{2}+(\alpha^{\top}x_{j})^{2})
\displaystyle\lesssim |wi|p+1|xj|p+5,\displaystyle\leavevmode\nobreak\ |w_{i}|^{p+1}\cdot|x_{j}|^{p+5},

By Assumption A.5, using Claim D.1 and Bdpolylog(d)B\geq d\cdot\mathrm{poly}\log(d) we have

𝔼[Ci(xj)]m4,i(w¯iα)2w¯iw¯i\displaystyle\|\mathbb{E}[C_{i}(x_{j})]-m_{4,i}(\overline{w}_{i}^{\top}\alpha)^{2}\overline{w}_{i}\overline{w}_{i}^{\top}\|\lesssim 𝔼[𝟙xjB|wi|p+1|xj|p+5]\displaystyle\leavevmode\nobreak\ \mathbb{E}[\mathbbm{1}_{\|x_{j}\|\geq B}|w_{i}|^{p+1}\cdot|x_{j}|^{p+5}]
\displaystyle\lesssim (wid)p+5eΩ(dlogd)\displaystyle\leavevmode\nobreak\ (\|w_{i}\|d)^{p+5}\cdot e^{-\Omega(d\log d)}
\displaystyle\lesssim ϵ.\displaystyle\leavevmode\nobreak\ \epsilon.

Also, 12|m4,i|(w¯iα)2𝔼[Ci(xj)]2|m4,i|(w¯iα)2\frac{1}{2}|m_{4,i}|(\overline{w}_{i}^{\top}\alpha)^{2}\leq\|\mathbb{E}[C_{i}(x_{j})]\|\leq 2|m_{4,i}|(\overline{w}_{i}^{\top}\alpha)^{2}.

For any constant t(0,1)t\in(0,1), we have with probability 1t/41-t/4,

Ci(xj)\displaystyle\|C_{i}(x_{j})\|\lesssim (|wixj|p+1+|ϕ(0)|+|ξj|)((xjα)2xj2+1+xj2+(αxj)2)\displaystyle(|w_{i}^{\top}x_{j}|^{p+1}+|\phi(0)|+|\xi_{j}|)\cdot((x_{j}^{\top}\alpha)^{2}\|x_{j}\|^{2}+1+\|x_{j}\|^{2}+(\alpha^{\top}x_{j})^{2})
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)dpoly(log(d/t))\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)\cdot d\cdot\mathrm{poly}(\log(d/t))

where the first step comes from Assumption A.5 and the second step comes from Claim D.2 and Claim D.3.

Using Claim D.4, we have

𝔼[Ci(xj)2]\displaystyle\left\|\mathbb{E}[C_{i}(x_{j})^{2}]\right\|\lesssim (𝔼[(ϕ(wixj)+ξj)4])1/2(𝔼[(xjα)8])1/2(𝔼[xj4])1/2\displaystyle\leavevmode\nobreak\ \left(\mathbb{E}[(\phi(w_{i}^{\top}x_{j})+\xi_{j})^{4}]\right)^{1/2}\left(\mathbb{E}[(x_{j}^{\top}\alpha)^{8}]\right)^{1/2}\left(\mathbb{E}[\|x_{j}\|^{4}]\right)^{1/2}
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)2d.\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)^{2}d.

Furthermore we have,

maxa=1(𝔼[(aCi(xj)a)2])1/2(𝔼[(ϕ(wixj)+ξj)4])1/4wip+1+|ϕ(0)|+ϑ.\displaystyle\max_{\|a\|=1}\left(\mathbb{E}\left[(a^{\top}C_{i}(x_{j})a)^{2}\right]\right)^{1/2}\lesssim\left(\mathbb{E}\left[(\phi(w_{i}^{\top}x_{j})+\xi_{j})^{4}\right]\right)^{1/4}\lesssim\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta.

Then by Claim D.6, with probability at least 1t1-t,

m4,i(w¯iα)2w¯iw¯i1|S|xjSCi(xj)\displaystyle\leavevmode\nobreak\ \left\|m_{4,i}(\overline{w}_{i}^{\top}\alpha)^{2}\overline{w}_{i}\overline{w}_{i}^{\top}-\frac{1}{|S|}\sum_{x_{j}\in S}C_{i}(x_{j})\right\|
\displaystyle\leq m4,i(w¯iα)2w¯iw¯i𝔼[Ci(xj)]+𝔼[Ci(xj)]1|S|xjSCi(xj)\displaystyle\leavevmode\nobreak\ \left\|m_{4,i}(\overline{w}_{i}^{\top}\alpha)^{2}\overline{w}_{i}\overline{w}_{i}^{\top}-\mathbb{E}[C_{i}(x_{j})]\right\|+\left\|\mathbb{E}[C_{i}(x_{j})]-\frac{1}{|S|}\sum_{x_{j}\in S}C_{i}(x_{j})\right\|
\displaystyle\lesssim ϵ|m4,i|(w¯iα)2+ϵ.\displaystyle\leavevmode\nobreak\ \epsilon|m_{4,i}|(\overline{w}_{i}^{\top}\alpha)^{2}+\epsilon.

Summing up all components i[k]i\in[k], we proved Eq. (9). Eq. (7) and Eq. (8) can be shown similarly.

Lemma A.18.

Let Vd×kV\in\mathbb{R}^{d\times k} be an orthogonal matrix. Let R^3\widehat{R}_{3} be computed in Line 20 of Algorithm 7 and R3=P3(V,V,V)R_{3}=P_{3}(V,V,V). Suppose

m0=mini[k]{|mj3,i|2(w¯iα)2(j33)}m_{0}=\min_{i\in[k]}\{|m_{j_{3},i}|^{2}(\overline{w}_{i}^{\top}\alpha)^{2(j_{3}-3)}\}

and

|S|dpoly(κ,ϑ,log(d/t))/(ϵ2m0)\displaystyle|S|\gtrsim d\cdot\mathrm{poly}(\kappa,\vartheta,\log(d/t))/(\epsilon^{2}m_{0})

then with probability at least 1t1-t,

R3R^3ϵi=1k|vimj3,i(wi¯α)j33|+ϵ.\displaystyle\|R_{3}-\widehat{R}_{3}\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{j_{3},i}(\bar{w_{i}}^{\top}\alpha)^{j_{3}-3}|+\epsilon.
Proof.

From the definition of R3R_{3}, it suffices to bound M3(V,V,V)M^3(V,V,V)\|M_{3}(V,V,V)-\widehat{M}_{3}(V,V,V)\| and M4(V,V,V,α)M^4(V,V,V,α)\|M_{4}(V,V,V,\alpha)-\widehat{M}_{4}(V,V,V,\alpha)\|. The proof is similar to the previous one.

Specifically, we show that with probability at least 1t/41-t/4,

M3(V,V,V)M^3(V,V,V)ϵi=1k|vim3,i|+ϵ.\displaystyle\|M_{3}(V,V,V)-\widehat{M}_{3}(V,V,V)\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{3,i}|+\epsilon. (10)
M4(V,V,V,α)M^4(V,V,V,α)ϵi=1k|vim4,i(w¯iα)|+ϵ.\displaystyle\|M_{4}(V,V,V,\alpha)-\widehat{M}_{4}(V,V,V,\alpha)\|\lesssim\epsilon\sum_{i=1}^{k}|v_{i}m_{4,i}(\overline{w}_{i}^{\top}\alpha)|+\epsilon. (11)

Recall that for sample (xj,yj)S(x_{j},y_{j})\in S, yj=i=1kviσ(wixj)+ξjy_{j}=\sum_{i=1}^{k}v_{i}\sigma(w_{i}^{\top}x_{j})+\xi_{j} where ξj\xi_{j} is independent of xjx_{j}. Consider each component i[k]i\in[k]. Define Ti(xj),Si(xj)k×k×kT_{i}(x_{j}),S_{i}(x_{j})\in\mathbb{R}^{k\times k\times k}:

Ti(xj)=\displaystyle T_{i}(x_{j})= (σ(wixj)+ξj)\displaystyle\leavevmode\nobreak\ (\sigma(w_{i}^{\top}x_{j})+\xi_{j})
(xiαv(x)3(Vα)~(v(x)v(x))αxv(x)~I+(Vα)~I),\displaystyle\leavevmode\nobreak\ \cdot\left(x_{i}^{\top}\alpha\cdot v(x)^{\otimes 3}-(V^{\top}\alpha)\tilde{\otimes}(v(x)\otimes v(x))-\alpha^{\top}x\cdot v(x)\tilde{\otimes}I+(V^{\top}\alpha)\tilde{\otimes}I\right),
Si(xj)=\displaystyle S_{i}(x_{j})= 𝟙(xjB)Ti(xj)\displaystyle\leavevmode\nobreak\ \mathbbm{1}({\|x_{j}\|\leq B})\cdot T_{i}(x_{j})

where v(x)=Vxv(x)=V^{\top}x. Flatten Ti(xj)T_{i}(x_{j}) along the first dimension to obtain Bi(xj)k×k2B_{i}(x_{j})\in\mathbb{R}^{k\times k^{2}}, flatten Si(xj)S_{i}(x_{j}) along the first dimension to obtain Ci(xj)k×k2C_{i}(x_{j})\in\mathbb{R}^{k\times k^{2}}.

From Claim D.7, 𝔼[Bi(xj)]=m4,i(αw¯i)(Vw¯i)vec((Vw¯i)(Vw¯i))\mathbb{E}[B_{i}(x_{j})]=m_{4,i}(\alpha^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})\text{vec}((V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top})^{\top}. Therefore we have,

𝔼[Bi(x)]=|m4,i(αw¯i)|Vw¯i3.\displaystyle\left\|\mathbb{E}[B_{i}(x)]\right\|=|m_{4,i}(\alpha^{\top}\overline{w}_{i})|\cdot\|V^{\top}\overline{w}_{i}\|^{3}.

We calculate

𝔼ξj[Bi(xj)]\displaystyle\|\mathbb{E}_{\xi_{j}}[B_{i}(x_{j})]\|\lesssim (|wixj|p+1+|ϕ(0)|)((xjα)2Vxj3\displaystyle(|w_{i}^{\top}x_{j}|^{p+1}+|\phi(0)|)\cdot((x_{j}^{\top}\alpha)^{2}\|V^{\top}x_{j}\|^{3}
+3Vxj3+3|xjα|Vxjk+3Vαk)\displaystyle\leavevmode\nobreak\ +3\|V^{\top}x_{j}\|^{3}+3|x_{j}^{\top}\alpha|\|V^{\top}x_{j}\|\sqrt{k}+3\|V^{\top}\alpha\|\sqrt{k})
\displaystyle\lesssim kwip+1xjp+6.\displaystyle\leavevmode\nobreak\ \sqrt{k}\cdot\|w_{i}\|^{p+1}\|x_{j}\|^{p+6}.

By Assumption A.5, using Claim D.1 and Bdpolylog(d)B\geq d\cdot\mathrm{poly}\log(d),

𝔼[Ci(xj)]m4,i(αw¯i)(Vw¯i)vec((Vw¯i)(Vw¯i))\displaystyle\leavevmode\nobreak\ \|\mathbb{E}[C_{i}(x_{j})]-m_{4,i}(\alpha^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})\text{vec}((V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top})^{\top}\|
\displaystyle\lesssim 𝔼[𝟙xjBkwip+1xjp+6]\displaystyle\leavevmode\nobreak\ \mathbb{E}[\mathbbm{1}_{\|x_{j}\|\leq B}\sqrt{k}\|w_{i}\|^{p+1}\|x_{j}\|^{p+6}]
\displaystyle\leq ϵ.\displaystyle\leavevmode\nobreak\ \epsilon.

For any constant t(0,1)t\in(0,1), we have with probability 1t1-t,

Ci(xj)\displaystyle\|C_{i}(x_{j})\|\lesssim (|wixj|p+1+|ϕ(0)|+|ξj|)((xjα)2Vxj3\displaystyle(|w_{i}^{\top}x_{j}|^{p+1}+|\phi(0)|+|\xi_{j}|)\cdot((x_{j}^{\top}\alpha)^{2}\|V^{\top}x_{j}\|^{3}
+3Vxj3+3|xjα|Vxjk+3Vαk)\displaystyle\leavevmode\nobreak\ +3\|V^{\top}x_{j}\|^{3}+3|x_{j}^{\top}\alpha|\|V^{\top}x_{j}\|\sqrt{k}+3\|V^{\top}\alpha\|\sqrt{k})
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)k3/2poly(log(d/t))\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)k^{3/2}\mathrm{poly}(\log(d/t))

where the first step comes from Assumption A.5 and the second step comes from Claim D.2 and Claim D.3.

Using Claim D.4, we have

𝔼[Ci(xj)Ci(xj)]\displaystyle\left\|\mathbb{E}[C_{i}(x_{j})C_{i}(x_{j})^{\top}]\right\|\lesssim (𝔼[(ϕ(wixj)+ξj)4])1/2(𝔼[(αxj)4])1/2(𝔼[Vxj6])1/2\displaystyle\leavevmode\nobreak\ \left(\mathbb{E}\left[(\phi(w_{i}^{\top}x_{j})+\xi_{j})^{4}\right]\right)^{1/2}\left(\mathbb{E}\left[(\alpha^{\top}x_{j})^{4}\right]\right)^{1/2}\left(\mathbb{E}\left[\|V^{\top}x_{j}\|^{6}\right]\right)^{1/2}
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)2k3/2.\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)^{2}k^{3/2}.

and

𝔼[Ci(xj)Ci(xj)]\displaystyle\leavevmode\nobreak\ \left\|\mathbb{E}[C_{i}(x_{j})^{\top}C_{i}(x_{j})]\right\|
\displaystyle\lesssim (𝔼[(ϕ(wixj)+ξj)4])1/2(𝔼[(αxj)4])1/2(𝔼[Vxj4])1/2\displaystyle\leavevmode\nobreak\ \left(\mathbb{E}[(\phi(w_{i}^{\top}x_{j})+\xi_{j})^{4}]\right)^{1/2}\left(\mathbb{E}[(\alpha^{\top}x_{j})^{4}]\right)^{1/2}\left(\mathbb{E}[\|V^{\top}x_{j}\|^{4}]\right)^{1/2}
(maxAF=1𝔼[A,(Vxj)(Vxj)4])1/2\displaystyle\leavevmode\nobreak\ \cdot\left(\max_{\|A\|_{F}=1}\mathbb{E}\left[\langle A,(V^{\top}x_{j})(V^{\top}x_{j})^{\top}\rangle^{4}\right]\right)^{1/2}
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)2k2.\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)^{2}k^{2}.

Furthermore we have,

maxa=b=1(𝔼[(aCi(xj)b)2])1/2\displaystyle\leavevmode\nobreak\ \max_{\|a\|=\|b\|=1}\left(\mathbb{E}\left[(a^{\top}C_{i}(x_{j})b)^{2}\right]\right)^{1/2}
\displaystyle\lesssim (𝔼[(ϕ(wixj)+ξj)4])1/4(𝔼[(αxj)4])1/4maxa=1(𝔼[(aVxj)4])1/2\displaystyle\leavevmode\nobreak\ \left(\mathbb{E}[(\phi(w_{i}^{\top}x_{j})+\xi_{j})^{4}]\right)^{1/4}\left(\mathbb{E}\left[(\alpha^{\top}x_{j})^{4}\right]\right)^{1/4}\max_{\|a\|=1}\left(\mathbb{E}\left[(a^{\top}V^{\top}x_{j})^{4}\right]\right)^{1/2}
maxAF=1(𝔼[A,(Vxj)(Vxj)4])1/2\displaystyle\leavevmode\nobreak\ \cdot\max_{\|A\|_{F}=1}\left(\mathbb{E}\left[\langle A,(V^{\top}x_{j})(V^{\top}x_{j})^{\top}\rangle^{4}\right]\right)^{1/2}
\displaystyle\lesssim (wip+1+|ϕ(0)|+ϑ)k.\displaystyle\leavevmode\nobreak\ (\|w_{i}\|^{p+1}+|\phi(0)|+\vartheta)k.

Then by Claim D.6, with probability at least 1t1-t,

m4,i(αw¯i)(Vw¯i)vec((Vw¯i)(Vw¯i))1|S|xjSCi(xj)\displaystyle\leavevmode\nobreak\ \left\|m_{4,i}(\alpha^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})\text{vec}((V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top})^{\top}-\frac{1}{|S|}\sum_{x_{j}\in S}C_{i}(x_{j})\right\|
\displaystyle\leq m4,i(αw¯i)(Vw¯i)vec((Vw¯i)(Vw¯i))𝔼[Ci(xj)]\displaystyle\leavevmode\nobreak\ \left\|m_{4,i}(\alpha^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})\text{vec}((V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top})^{\top}-\mathbb{E}[C_{i}(x_{j})]\right\|
+𝔼[Ci(xj)]1|S|xjSCi(xj)\displaystyle\leavevmode\nobreak\ +\left\|\mathbb{E}[C_{i}(x_{j})]-\frac{1}{|S|}\sum_{x_{j}\in S}C_{i}(x_{j})\right\|
\displaystyle\lesssim ϵ|vim4,i(w¯iα)|+ϵ.\displaystyle\leavevmode\nobreak\ \epsilon|v_{i}m_{4,i}(\overline{w}_{i}^{\top}\alpha)|+\epsilon.

Summing up all neurons i[k]i\in[k], we proved Eq. (11). Eq. (10) can be shown similarly.

Lemma A.19.

Let Q^1\widehat{Q}_{1} and Q^2\widehat{Q}_{2} be computed in Line 28 of Algorithm 7. Let Q1Q_{1} be defined by Eq. 4 and Q2Q_{2} be defined by Eq. 5. Suppose

m0=mini[k]{|mj1,i|2(w¯iα)2(j11),|mj2,i|2(w¯iα)2(j22)}m_{0}=\min_{i\in[k]}\{|m_{j_{1},i}|^{2}(\overline{w}_{i}^{\top}\alpha)^{2(j_{1}-1)},|m_{j_{2},i}|^{2}(\overline{w}_{i}^{\top}\alpha)^{2(j_{2}-2)}\}

and

|S|dpoly(κ,ϑ,log(d/t))/(ϵ2m0)\displaystyle|S|\gtrsim d\cdot\mathrm{poly}(\kappa,\vartheta,\log(d/t))/(\epsilon^{2}m_{0})

then with probability at least 1t1-t,

Q1Q^1\displaystyle\|Q_{1}-\widehat{Q}_{1}\|\lesssim ϵi=1k|vimj1,i(wi¯α)j11|+ϵ,\displaystyle\leavevmode\nobreak\ \epsilon\sum_{i=1}^{k}|v_{i}m_{j_{1},i}(\bar{w_{i}}^{\top}\alpha)^{j_{1}-1}|+\epsilon,
Q2Q^2\displaystyle\|Q_{2}-\widehat{Q}_{2}\|\lesssim ϵi=1k|vimj2,i(wi¯α)j22|+ϵ.\displaystyle\leavevmode\nobreak\ \epsilon\sum_{i=1}^{k}|v_{i}m_{j_{2},i}(\bar{w_{i}}^{\top}\alpha)^{j_{2}-2}|+\epsilon.
Proof.

Recall the expression of Q1Q_{1} and Q2Q_{2},

Q1=Ml1(I,α,,α(j11)α’s)=i=1kvicj1wip+1(αw¯i)j11w¯i,\displaystyle Q_{1}=M_{l_{1}}(I,\underbrace{\alpha,\cdots,\alpha}_{(j_{1}-1)\leavevmode\nobreak\ \alpha\text{'s}})=\sum_{i=1}^{k}v_{i}c_{j_{1}}\|w_{i}\|^{p+1}(\alpha^{\top}\overline{w}_{i})^{{j_{1}-1}}\overline{w}_{i},
Q2=Mj2(V,V,α,,α(j22)α’s)=i=1kvicj2wip+1(αw¯i)j22(Vw¯i)(Vw¯i).\displaystyle Q_{2}=M_{j_{2}}(V,V,\underbrace{\alpha,\cdots,\alpha}_{(j_{2}-2)\leavevmode\nobreak\ \alpha\text{'s}})=\sum_{i=1}^{k}v_{i}c_{j_{2}}\|w_{i}\|^{p+1}(\alpha^{\top}\overline{w}_{i})^{{j_{2}-2}}(V^{\top}\overline{w}_{i})(V^{\top}\overline{w}_{i})^{\top}.

The proof is essentially similar to Lemma A.17 and Lemma A.18. ∎

We also use the following Lemmata from [KCL15, ZSJ+17].

Lemma A.20 (Adapted from Theorem 3 of [KCL15]).

Given a tensor T^=i=1kπiui3+ϵRd×d×d\widehat{T}=\sum_{i=1}^{k}\pi_{i}u_{i}^{\otimes 3}+\epsilon R\in\mathbb{R}^{d\times d\times d}. Assume incoherence uiujμu_{i}^{\top}u_{j}\leq\mu. Let L0:=(501μ2)2L_{0}:=(\frac{50}{1-\mu^{2}})^{2} and LL0log(15d(k1)/t)2L\geq L_{0}\log(15d(k-1)/t)^{2}. Then there exists an algorithm such that, with probability at least 1t1-t, for every uiu_{i}, the algorithm returns a u~i\tilde{u}_{i} such that

u~iui2O(π1πmaxπmin2V221μ2(1+C(t)))ϵ+o(ϵ),\displaystyle\|\tilde{u}_{i}-u_{i}\|_{2}\leq O\left(\frac{\sqrt{\|\pi\|_{1}\pi_{\max}}}{\pi_{\min}^{2}}\cdot\frac{\|V^{\top}\|_{2}^{2}}{1-\mu^{2}}\cdot(1+C(t))\right)\epsilon+o(\epsilon),

where C(t):=log(kd/t)d/LC(t):=\log(kd/t)\sqrt{d/L} and VV is the inverse of the full-rank extension of (u1uk)(u_{1}\dots u_{k}) with unit-norm columns.

Lemma A.21 (Adapted from Lemma E.6 of [ZSJ+17]).

Let P2P_{2} be defined as in Definition A.16 and P^2\widehat{P}_{2} be its empirical version calculated in Line 4 of Algorithm 7. Let Ud×kU\in\mathbb{R}^{d\times k} be the orthogonal column span of Wd×kW\in\mathbb{R}^{d\times k}. Assume P^2P2sk(P2)/10\|\widehat{P}_{2}-P_{2}\|\leq s_{k}(P_{2})/10. Let CC be a large enough positive number such that C>2P2C>2\|P_{2}\|. Then after T=O(log(1/ϵ))T=O(\log(1/\epsilon)) iterations, the Vd×kV\in\mathbb{R}^{d\times k} computed in Algorithm 7 will satisfy

UUVVP^2P2/sk(P2)+ϵ,\displaystyle\|UU^{\top}-VV^{\top}\|\lesssim\|\widehat{P}_{2}-P_{2}\|/s_{k}(P_{2})+\epsilon,

which implies

(IVV)wi(P^2P2/sk(P2)+ϵ)wi.\displaystyle\|(I-VV^{\top})w_{i}\|\lesssim(\|\widehat{P}_{2}-P_{2}\|/s_{k}(P_{2})+\epsilon)\|w_{i}\|.
Lemma A.22 (Adapted from Lemma E.13 in [ZSJ+17]).

Let Ud×kU\in\mathbb{R}^{d\times k} be the orthogonal column span of WW^{*}. Let Vd×kV\in\mathbb{R}^{d\times k} denote an orthogonal matrix satisfying that VVUUδ^21/(κ2k)\|VV^{\top}-UU^{\top}\|\leq\widehat{\delta}_{2}\lesssim 1/(\kappa^{2}\sqrt{k}). For each i[k]i\in[k], let u^i\widehat{u}_{i} denote the vector satisfying siu^iVw¯iδ^31/(κ2k)\|s_{i}\widehat{u}_{i}-V^{\top}\overline{w}_{i}\|\leq\widehat{\delta}_{3}\lesssim 1/(\kappa^{2}\sqrt{k}). Let Q1Q_{1} be defined as in Eq. (4) and Q^1\widehat{Q}_{1} be the empirical version of Q1Q_{1} such that Q1Q^1δ^4Q114Q1\|Q_{1}-\widehat{Q}_{1}\|\leq\widehat{\delta}_{4}\|Q_{1}\|\leq\frac{1}{4}\|Q_{1}\|.

Let zkz^{*}\in\mathbb{R}^{k} and z^k\widehat{z}\in\mathbb{R}^{k} be defined as in Eq. (6) and Line 29. Then

|z^izi|(κ4k3/2(δ^2+δ^3)+κ2k1/2δ^4)z1.\displaystyle|\widehat{z}_{i}-z_{i}^{*}|\leq(\kappa^{4}k^{3/2}(\widehat{\delta}_{2}+\widehat{\delta}_{3})+\kappa^{2}k^{1/2}\widehat{\delta}_{4})\|z^{*}\|_{1}.
Lemma A.23 (Adapted from Lemma E.14 in [ZSJ+17]).

Let Ud×kU\in\mathbb{R}^{d\times k} be the orthogonal column span of WW^{*} and VV be an orthogonal matrix satisfying that VVUUδ^21/(κk)\|VV^{\top}-UU^{\top}\|\leq\widehat{\delta}_{2}\lesssim 1/(\kappa\sqrt{k}). For each i[k]i\in[k], let u^i\widehat{u}_{i} denote the vector satisfying siu^iVw¯iδ^31/(kκ3)\|s_{i}\widehat{u}_{i}-V^{\top}\overline{w}_{i}^{*}\|\leq\widehat{\delta}_{3}\lesssim 1/(\sqrt{k}\kappa^{3}).

Let Q2Q_{2} be defined as in Eq. (5) and Q^2\widehat{Q}_{2} be the empirical version of Q2Q_{2} such that Q2Q^2Fδ^4Q2F14Q2F\|Q_{2}-\widehat{Q}_{2}\|_{F}\leq\widehat{\delta}_{4}\|Q_{2}\|_{F}\leq\frac{1}{4}\|Q_{2}\|_{F}. Let rkr^{*}\in\mathbb{R}^{k} and r^k\widehat{r}\in\mathbb{R}^{k} be defined as in Eq. (6) and Line 29. Then

|r^iri|(k3κ8δ^3+κ2k2δ^4)r.\displaystyle|\widehat{r}_{i}-r_{i}^{*}|\leq(k^{3}\kappa^{8}\widehat{\delta}_{3}+\kappa^{2}k^{2}\widehat{\delta}_{4})\|r^{*}\|.

Now we are in the position of proving Theorem A.13.

Proof.

Consider Algorithm 7. First, by Lemma A.21 and Lemma A.17, we have

VVw¯iw¯i\displaystyle\|VV^{\top}\overline{w}_{i}-\overline{w}_{i}\|\leq (P^2P2/sk(P2)+ϵ)\displaystyle\leavevmode\nobreak\ (\|\widehat{P}_{2}-P_{2}\|/s_{k}(P_{2})+\epsilon)
\displaystyle\leq (poly(k,κ)P^2P2+ϵ)\displaystyle\leavevmode\nobreak\ (\mathrm{poly}(k,\kappa)\|\widehat{P}_{2}-P_{2}\|+\epsilon)
\displaystyle\leq poly(k,κ)ϵ.\displaystyle\leavevmode\nobreak\ \mathrm{poly}(k,\kappa)\epsilon. (12)

Next, combining Lemma A.20 and Lemma A.18, we have

Vw¯isiu^ipoly(k,κ)R^3R3ϵpoly(k,κ).\displaystyle\|V^{\top}\overline{w}_{i}-s_{i}\widehat{u}_{i}\|\leq\mathrm{poly}(k,\kappa)\|\widehat{R}_{3}-R_{3}\|\leq\epsilon\mathrm{poly}(k,\kappa). (13)

It thus follows that

w¯isiVu^i\displaystyle\|\overline{w}_{i}-s_{i}V\widehat{u}_{i}\|\leq VVw¯iw¯i+VVw¯iVsiu^i\displaystyle\leavevmode\nobreak\ \|VV^{\top}\overline{w}_{i}-\overline{w}_{i}\|+\|VV^{\top}\overline{w}_{i}-Vs_{i}\widehat{u}_{i}\|
=\displaystyle= VVw¯iw¯i+Vw¯isiu^i\displaystyle\leavevmode\nobreak\ \|VV^{\top}\overline{w}_{i}-\overline{w}_{i}\|+\|V^{\top}\overline{w}_{i}-s_{i}\widehat{u}_{i}\|
\displaystyle\leq ϵpoly(k,κ),\displaystyle\leavevmode\nobreak\ \epsilon\mathrm{poly}(k,\kappa), (14)

where the first step applies triangle inequality and the last step uses Eq. (A.5.1) and Eq. (13).

We proceed to bound the error in r^\widehat{r} and z^\widehat{z}. We have,

|r^iri|\displaystyle|\widehat{r}_{i}-r_{i}^{*}|\lesssim poly(k,κ)(Q2Q^2+siu^iVw¯i)r\displaystyle\leavevmode\nobreak\ \mathrm{poly}(k,\kappa)(\|Q_{2}-\widehat{Q}_{2}\|+\|s_{i}\widehat{u}_{i}-V^{\top}\overline{w}_{i}\|)\cdot\|r^{*}\|
\displaystyle\lesssim ϵpoly(k,κ)r\displaystyle\leavevmode\nobreak\ \epsilon\mathrm{poly}(k,\kappa)\cdot\|r^{*}\| (15)

where the first step comes from Lemma A.23 and the second step comes from Lemma A.19 and Eq. (A.5.1). Furthermore,

|z^izi|\displaystyle|\widehat{z}_{i}-z_{i}^{*}|\lesssim poly(k,κ)(Q1Q^1+siu^iVw¯i+VVUU)z1\displaystyle\leavevmode\nobreak\ \mathrm{poly}(k,\kappa)\cdot(\|Q_{1}-\widehat{Q}_{1}\|+\|s_{i}\widehat{u}_{i}-V^{\top}\overline{w}_{i}\|+\|VV^{\top}-UU^{\top}\|)\cdot\|z^{*}\|_{1}
\displaystyle\lesssim ϵpoly(k,κ)z1,\displaystyle\leavevmode\nobreak\ \epsilon\mathrm{poly}(k,\kappa)\|z^{*}\|_{1}, (16)

where the first step comes from Lemma A.22 and the second step comes from combining Lemma A.19, Lemma A.21, and Eq. (A.5.1). Finally, combining Eq. (A.5.1), Eq. (A.5.1) and Eq. (A.5.1), the output in Line 32 satisfies w^iwiFϵpoly(k,κ)wiF\|{\widehat{w}}_{i}-w_{i}\|_{F}\leq\epsilon\mathrm{poly}(k,\kappa)\cdot\|w_{i}\|_{F}. Since viv_{i} are discrete values, they are exactly recovered.

A.5.2 Exact recovery of neural networks from noiseless data

In this section we prove Theorem A.15. Similar to Appendix A.5.1, denote W=[w1,,wk]W=[w_{1},\cdots,w_{k}]^{\top} where widw_{i}\in\mathbb{R}^{d} and W^=[w^1,,w^k]\widehat{W}=[\widehat{w}_{1},\cdots,\widehat{w}_{k}]^{\top}. We use 𝒟\cal D to denote the distribution of x𝒩(0,Id)x\sim\mathcal{N}(0,I_{d}) and y=v,σ(Wx)y=\langle v,\sigma(Wx)\rangle. We define the empirical loss for explored features and the population loss as follows,

Ln(W^)=\displaystyle L_{n}(\widehat{W})= 12n(x,y)S(1)(i=1kviσ(w^ixi)yi)2,\displaystyle\leavevmode\nobreak\ \frac{1}{2n}\sum_{(x,y)\in S^{(1)}}\left(\sum_{i=1}^{k}v_{i}\sigma(\widehat{w}_{i}^{\top}x_{i})-y_{i}\right)^{2}, (17)
L(W^)=\displaystyle L(\widehat{W})= 12𝔼𝒟[(i=1kviσ(w^ix)y)2].\displaystyle\leavevmode\nobreak\ \frac{1}{2}{\mathbb{E}_{\cal D}}\left[\left(\sum_{i=1}^{k}v_{i}\sigma(\widehat{w}_{i}^{\top}x)-y\right)^{2}\right]. (18)
Algorithm 8 Using method of moments and gradient descent to recover neural network parameters
1:procedure NeuralNetRecovery(S={(xi,yi):i[n]}S=\{(x_{i},y_{i}):i\in[n]\})
2:     Let S(1){(x,y)S:xB}S^{(1)}\leftarrow\{(x,y)\in S:\|x\|\leq B\}
3:     Compute (v,W^)NeuralNetNoisyRecovery(S(1))(v,\widehat{W})\leftarrow\textsc{NeuralNetNoisyRecovery}(S^{(1)})
4:     Find W(1)W^{(1)} as the global minimum of Ln()L_{n}(\cdot) by gradient descent initialized at W^\widehat{W}, where Ln()L_{n}(\cdot) is defined in Eq. (17).
5:     Return (v,W(1))(v,W^{(1)})
6:end procedure
Definition A.24.

Let sis_{i} be the ii-th singular value of WW, λ:=i=1k(si/sk)\lambda:=\prod_{i=1}^{k}(s_{i}/s_{k}). Let τ=(3s1/2)4p/minz[sk/2,3s1/2]{ρ2(z)}\tau=(3s_{1}/2)^{4p}/\min_{z\in[s_{k}/2,3s_{1}/2]}\{\rho^{2}(z)\}.

We use the follow results adapted from [ZSJ+17]. The only difference is that the rewards are potentially truncated if xB\|x\|\geq B, and due to B=dpolylog(d)B=d\cdot\mathrm{poly}\log(d) we can bound its difference between standard Gaussian in the same way as Appendix A.5.1.

Lemma A.25 (Concentration, adapted from Lemma D.11 in [ZSJ+17]).

Let samples size nϵ2dτpoly(log(d/t))n\geq\epsilon^{-2}d\tau\mathrm{poly}(\log(d/t)), then with probability at least 1t1-t,

2Ln(W)2L(W)ks12pϵ+poly(BW,d)eΩ(d).\displaystyle\|\nabla^{2}L_{n}(W)-\nabla^{2}L(W)\|\lesssim ks_{1}^{2p}\epsilon+\mathrm{poly}(B_{W},d)e^{-\Omega(d)}.
Lemma A.26 (Adapted from Lemma D.16 in [ZSJ+17]).

Assume activation σ()\sigma(\cdot) satisfies Assumption A.8 and Assumption A.7. Then for any t(0,1)t\in(0,1), if ndpoly(log(d/t))n\geq d\cdot\mathrm{poly}(\log(d/t)), with probability at least 1t1-t, for any W^\widehat{W} (which is not necessarily to be independent of samples) satisfying WW^sk/4\|W-\widehat{W}\|\leq s_{k}/4, we have

2Ln(W^)2Ln(W)ks1pWW^d(p+1)/2.\displaystyle\|\nabla^{2}L_{n}(\widehat{W})-\nabla^{2}L_{n}(W)\|\leq ks_{1}^{p}\|W-\widehat{W}\|d^{(p+1)/2}.

Now we prove Theorem A.15.

Proof.

The exact recovery consists of first finding (exact) vv and (approximate) W^\widehat{W} close enough to WW by tensor method (Appendix A.5.1), and then minimizing the empirical loss Ln()L_{n}(\cdot). We will prove that Ln()L_{n}(\cdot) is locally strongly convex, thus we find the precise WW.

From Lemma D.3 from [ZSJ+17] we know:

Ω(ρ(sk)/λ)I2L(W)O(ks12p)I.\Omega(\rho(s_{k})/\lambda)I\preceq\nabla^{2}L(W)\preceq O(ks_{1}^{2p})I. (19)

Combining Lemma A.25, dlog(BW/λ)d\geq\log(B_{W}/\lambda), and nk2λ2s14pρ2(sk)dτpoly(log(d/t))n\geq\frac{k^{2}\lambda^{2}s_{1}^{4p}}{\rho^{2}(s_{k})}d\tau\mathrm{poly}(\log(d/t)), we know 2Ln(W)\nabla^{2}L_{n}(W) must be positive definite.

Next we uniformly bound Lipschitzness of 2Ln\nabla^{2}L_{n}. From Lemma A.26 there exists a universal constant cc, such that for all W^\widehat{W} that satisfies WW^cks12p/(ks1pd(p+1)/2)=cs1pd(p+1)/2\|W-\widehat{W}\|\leq cks_{1}^{2p}/(ks_{1}^{p}d^{(p+1)/2})=cs_{1}^{p}d^{-(p+1)/2}, Ln2(W^)ks12p\nabla L_{n}^{2}(\widehat{W})\gtrsim ks_{1}^{2p} holds uniformly. So there is a unique miminizer of LnL_{n} in this region.

Notice Ln(W)=0L_{n}(W)=0, therefore we can find WW by directly minimizing the empirical loss as long as we find any W^\widehat{W} in this region. This can be achieved by tensor method in Appendix A.5.1. We thus complete the proof. ∎

Appendix B Omitted Proofs in Section 4

For the proofs of Theorem 4.2, Example 4.3, and Example 4.4, we refer the readers to [HHK+21].

Lemma B.1.

Consider the polynomial family 𝒱{\cal F}_{{\mathcal{V}}} of dimension DD. Assume that n>2Dn>2D. For any EdE\in\mathbb{R}^{d} that is of positive measure, by sampling nn samples {xi}\{x_{i}\} i.i.d. from x𝒩(0,Id)(|xE)\mathbb{P}_{x\in\mathcal{N}(0,I_{d})}(\cdot|x\in E) and observing the noiseless feedbacks yi=f(xi)y_{i}=f^{*}(x_{i}), one can almost surely uniquely determine the ff^{*} by solving the system of equations yi=f(xi)y_{i}=f(x_{i}), i=1,,ni=1,\dots,n, for f𝒱f\in{\cal F}_{{\mathcal{V}}}.

Proof.

By Theorem 4.2, there exists a set Nd×dN\in\mathbb{R}^{d}\times\ldots\mathbb{R}^{d} of Lebesgue measure zero, such that if (x1,,xn)N(x_{1},\cdots,x_{n})\notin N, one can uniquely determine the ff^{*} by the observations on the nn samples. Therefore, we only need to show that with probability 1, the sampling procedure returns (x1,,xn)N(x_{1},\dots,x_{n})\notin N. This is because

(x1,,xnN)\displaystyle\mathbb{P}(x_{1},\dots,x_{n}\in N) =xi𝒩(0,Id)((x1,,xn)Nx1,,xnE)\displaystyle=\mathbb{P}_{x_{i}\in{\mathcal{N}}(0,I_{d})}((x_{1},\dots,x_{n})\in N\mid x_{1},\dots,x_{n}\in E)
=xi𝒩(0,Id)((x1,,xn)N(E××E))xi𝒩(0,Id)((x1,,xn)(E××E))\displaystyle=\frac{\mathbb{P}_{x_{i}\in{\mathcal{N}}(0,I_{d})}((x_{1},\dots,x_{n})\in N\cap(E\times\dots\times E))}{\mathbb{P}_{x_{i}\in{\mathcal{N}}(0,I_{d})}((x_{1},\dots,x_{n})\in(E\times\dots\times E))}
=0[x1𝒩(0,Id)(x1E)]n\displaystyle=\frac{0}{[\mathbb{P}_{x_{1}\in{\mathcal{N}}(0,I_{d})}(x_{1}\in E)]^{n}}
=0.\displaystyle=0.

By Lemma B.1 above, it is not hard to see that Algorithms 4 and 5 work.

Appendix C Omitted Constructions and Proofs in Subsection 4.1

Construction of the Reward Functions

The following construction of the polynomial hard case is adopted from [HHK+21].

Let dd be the dimension of the feature space. Let eie_{i} denotes the ii-th standard orthonormal basis of d\mathbb{R}^{d}, i.e., eie_{i} has only one 11 at the ii-th entry and 0’s for other entries. Let pp denote the highest order of the polynomial. We assume dpd\gg p. We use Λ\Lambda to denote a subset of the pp-th multi-indices

Λ={(α1,,αp)|1α1αpd}.\Lambda=\{(\alpha_{1},\dots,\alpha_{p})|1\leq\alpha_{1}\leq\dots\leq\alpha_{p}\leq d\}.

For an α=(α1,,αp)Λ\alpha=(\alpha_{1},\dots,\alpha_{p})\in\Lambda, denote Mα=eα1eαpM_{\alpha}=e_{\alpha_{1}}\otimes\dots\otimes e_{\alpha_{p}}, xα=eα1++eαpx_{\alpha}=e_{\alpha_{1}}+\dots+e_{\alpha_{p}}.

The model space \mathcal{M} is a subset of rank-1 pp-th order tensors, which is defined as ={Mα|αΛ}\mathcal{M}=\{M_{\alpha}|\alpha\in\Lambda\}. We define two subsets of feature space 0\mathcal{F}_{0} and \mathcal{F} as 0={xα|αΛ},=conv(0)\mathcal{F}_{0}=\{x_{\alpha}|\alpha\in\Lambda\},\ \mathcal{F}=\mathrm{conv}(\mathcal{F}_{0}). For MαM_{\alpha}\in\mathcal{M}, xx\in\mathcal{F}, define r(Mα,x)r(M_{\alpha},x) as r(Mα,x)=Mα,xp=i=1peαi,xr(M_{\alpha},x)=\langle M_{\alpha},x^{\otimes p}\rangle=\prod_{i=1}^{p}\langle e_{\alpha_{i}},x\rangle. We assume that for each level hh, there is a M(h)=Mα(h)M^{(h)}=M_{\alpha^{(h)}}\in\mathcal{M}, and the noiseless reward is rh(s,a)=r(M(h),ϕh(s,a))r_{h}(s,a)=r(M^{(h)},\phi_{h}(s,a)).

We have the following properties.

Proposition C.1 ([HHK+21]).

For MαM_{\alpha}\in\mathcal{M} and xα0x_{\alpha^{\prime}}\in\mathcal{F}_{0}, we have

r(Mα,xα)=𝕀{α=α}.r(M_{\alpha},x_{\alpha^{\prime}})=\mathbb{I}_{\{\alpha=\alpha^{\prime}\}}.
Proposition C.2.

For MαM_{\alpha}\in\mathcal{M} , we have

maxxr(Mα,x)=1.\max_{x\in\mathcal{F}}r(M_{\alpha},x)=1.
proof of Proposition C.2.

For all xx\in\mathcal{F}, since =conv(0)\mathcal{F}=\mathrm{conv}(\mathcal{F}_{0}), we can write

x=αΛpα(eα1++eαp),x=\sum_{\alpha\in\Lambda}p_{\alpha}(e_{\alpha_{1}}+\dots+e_{\alpha_{p}}),

where αΛpα=1\sum_{\alpha\in\Lambda}p_{\alpha}=1 and pα0p_{\alpha}\geq 0. Therefore,

r(Mα,x)=i=1peαi,x.r(M_{\alpha^{\prime}},x)=\prod_{i=1}^{p}\langle e_{\alpha^{\prime}_{i}},x\rangle.

Plug in the expression of xx, we have

eαi,x\displaystyle\langle e_{\alpha^{\prime}_{i}},x\rangle =αpαeαi,eα1++eαp\displaystyle=\sum_{\alpha}p_{\alpha}\langle e_{\alpha^{\prime}_{i}},e_{\alpha_{1}}+\dots+e_{\alpha_{p}}\rangle
=αpα𝕀{αiα}\displaystyle=\sum_{\alpha}p_{\alpha}\mathbb{I}_{\{\alpha^{\prime}_{i}\in\alpha\}}
αpα=1.\displaystyle\leq\sum_{\alpha}p_{\alpha}=1.

Therefore,

r(Mα,x)\displaystyle r(M_{\alpha^{\prime}},x) =i=1peαi,x\displaystyle=\prod_{i=1}^{p}\langle e_{\alpha^{\prime}_{i}},x\rangle
=(αpα𝕀{eα1α})(αpα𝕀{eαpα})\displaystyle=\Big{(}\sum_{\alpha}p_{\alpha}\mathbb{I}_{\{e_{\alpha^{\prime}_{1}}\in\alpha\}}\Big{)}\cdots\Big{(}\sum_{\alpha}p_{\alpha}\mathbb{I}_{\{e_{\alpha^{\prime}_{p}}\in\alpha\}}\Big{)}
1.\displaystyle\leq 1.

Finally, since r(Mα,xα)=1r(M_{\alpha^{\prime}},x_{\alpha^{\prime}})=1, we have maxxr(Mα,x)=1\max_{x\in\mathcal{F}}r(M_{\alpha},x)=1. ∎

MDP constructions

Consider a family of MDPs with only two states 𝒮={Sgood,Sbad}{\mathcal{S}}=\{S_{\text{good}},S_{\text{bad}}\}. The action set 𝒜{\mathcal{A}} is set to be \mathcal{F}. Let ff be a mapping from \mathcal{F} to 0\mathcal{F}_{0} such that ff is identity when restricted to 0\mathcal{F}_{0}. For all level h[H]h\in[H], we define the feature map ϕh:S×A\phi_{h}:S\times A\to\mathcal{F} to be

ϕh(s,a)={a if s=Sgood,f(a) if s=Sbad.\phi_{h}(s,a)=\left\{\begin{array}[]{ll}a&\text{ if }s=S_{\text{good}},\\ f(a)&\text{ if }s=S_{\text{bad}}.\end{array}\right.

Given an unknown sequence of indices α(1),,α(H)\alpha^{(1)},\dots,\alpha^{(H)}, the reward function at level hh is rh(s,a)=r(Mα(h),ϕh(s,a))r_{h}(s,a)=r(M_{\alpha^{(h)}},\phi_{h}(s,a)). Specifically, we have

rh(Sgood,a)=r(Mα(h),a),rh(Sbad,a)=r(Mα(h),f(a)).r_{h}(S_{\text{good}},a)=r(M_{\alpha^{(h)}},a),\ r_{h}(S_{\text{bad}},a)=r(M_{\alpha^{(h)}},f(a)).

The transition PhP_{h} is constructed as

Ph(Sbad|s,a)=1 for all s𝒮,a𝒜.P_{h}(S_{\text{bad}}|s,a)=1\text{ for all }s\in{\mathcal{S}},a\in{\mathcal{A}}.

This construction means it is impossible for the online scenarios to reach the good state for h>1h>1.

The next proposition shows that QhQ^{*}_{h} is polynomial realizable and falls into the case of Example 4.4.

Proposition C.3.

We have for all h[H]h\in[H] and s𝒮s\in{\mathcal{S}}, a𝒜a\in{\mathcal{A}}, Vh(s)=Hh+1V^{*}_{h}(s)=H-h+1 and Qh(s,a)=rh(s,a)+Hh+1Q^{*}_{h}(s,a)=r_{h}(s,a)+H-h+1. Furthermore, Qh(s,a)Q^{*}_{h}(s,a), viewed as the function of ϕh(s,a)\phi_{h}(s,a), is a polynomial of the form qh(Uhϕh(s,a))q_{h}(U_{h}\phi_{h}(s,a)) for some degree-pp polynomial qhq_{h} and Uhp×dU_{h}\in\mathbb{R}^{p\times d}.

proof of Proposition C.3.

First notice that by Proposition C.2, for all h[H]h\in[H] and s𝒮s\in{\mathcal{S}}, we have

maxa𝒜rh(s,a)=1.\max_{a\in{\mathcal{A}}}r_{h}(s,a)=1.

Therefore, by induction, suppose we have proved for all ss^{\prime}, Vh+1(s)=HhV^{*}_{h+1}(s^{\prime})=H-h, then we have

Vh(s)\displaystyle V^{*}_{h}(s) =maxa𝒜Qh(s,a)\displaystyle=\max_{a\in{\mathcal{A}}}Q^{*}_{h}(s,a)
=maxa𝒜{rh(s,a)+𝔼sPh(|s,a)[Vh+1(s)]}\displaystyle=\max_{a\in{\mathcal{A}}}\{r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot|s,a)}[V^{*}_{h+1}(s^{\prime})]\}
=1+Hh.\displaystyle=1+H-h.

Then we have Qh(s,a)=rh(s,a)+Hh+1Q^{*}_{h}(s,a)=r_{h}(s,a)+H-h+1.

Furthermore, we have

Qh(s,a)\displaystyle Q^{*}_{h}(s,a) =rh(s,a)+Hh+1\displaystyle=r_{h}(s,a)+H-h+1
=r(Mα(h),ϕh(s,a))+Hh+1\displaystyle=r(M_{\alpha^{(h)}},\phi_{h}(s,a))+H-h+1
=i=1peαi(h),ϕh(s,a)+Hh+1\displaystyle=\prod_{i=1}^{p}\langle e_{\alpha_{i}^{(h)}},\phi_{h}(s,a)\rangle+H-h+1
=qh(Uhϕh(s,a)),\displaystyle=q_{h}(U_{h}\phi_{h}(s,a)),

where qh(x1,,xp)=x1x2xp+(Hh+1)q_{h}(x_{1},\dots,x_{p})=x_{1}x_{2}\cdots x_{p}+(H-h+1) and Uhp×dU_{h}\in\mathbb{R}^{p\times d} is a matrix with eαi(h)e_{\alpha_{i}^{(h)}} as the ii-th row. ∎

Theorem C.4.

Under the online RL setting, any algorithm needs to play at least ((dp)1)=Ω(dp)({d\choose p}-1)=\Omega(d^{p}) episodes to identify α(2),,α(H)\alpha^{(2)},\dots,\alpha^{(H)} and thus to identify the optimal policy.

proof of Theorem C.4.

Under the online RL setting, any algorithm enters and remains in SbadS_{\text{bad}} for h>1h>1. When sh=Sbads_{h}=S_{\text{bad}}, no matter what aha_{h} the algorithm chooses, we have ϕh(sh,ah)=f(ah)0\phi_{h}(s_{h},a_{h})=f(a_{h})\in\mathcal{F}_{0}. Notice that for any Mα(h)M_{\alpha^{(h)}}\in\mathcal{M} and any xα0x_{\alpha}\in\mathcal{F}_{0}, we have r(Mα(h),xα)=𝕀{α=α(h)}r(M_{\alpha^{(h)}},x_{\alpha})=\mathbb{I}_{\{\alpha=\alpha^{(h)}\}} as Proposition C.1 suggests. Hence, we need to play ((dp)1)({d\choose p}-1) times at level hh in the worst case to find out α(h)\alpha^{(h)}. The argument holds for all h=2,3,,Hh=2,3,\dots,H. ∎

Theorem C.5.

Under the generative model setting, by querying 2d(p+1)pH=O(dH)2d(p+1)^{p}H=O(dH) samples, we can almost surely identify α(1)\alpha^{(1)} ,α(2),,α(H)\alpha^{(2)},\dots,\alpha^{(H)} and thus identify the optimal policy.

proof of Theorem C.5.

By Proposition C.3, we know that Qh(s,a)Q^{*}_{h}(s,a), viewed as the function of ϕh(s,a)\phi_{h}(s,a), falls into the case of Example 4.4 with k=pk=p.

Next, notice that for all h[H]h\in[H], {ϕh(s,a)s𝒮,a𝒜}=\{\phi_{h}(s,a)\mid s\in{\mathcal{S}},a\in{\mathcal{A}}\}={\mathcal{F}}. Although {\mathcal{F}} is not of positive measure, we can actually know the value of QhQ^{*}_{h} when ϕh(s,a)\phi_{h}(s,a) is in conv(,𝟎)\mathrm{conv}({\mathcal{F}},\mathbf{0}) since the reward is pp-homogenous. Specifically, for every feature of the form cϕh(s,a)c\cdot\phi_{h}(s,a), where 0c10\leq c\leq 1 and ϕh(s,a)\phi_{h}(s,a)\in{\mathcal{F}}, the reward is cpc^{p} times the reward of (s,a)(s,a). Therefore, to get the reward at cϕh(s,a)c\cdot\phi_{h}(s,a), we only need to query the generative model at (s,a)(s,a) of level hh, and then multiply the reward by cpc^{p}.

Notice that conv(,𝟎)\mathrm{conv}({\mathcal{F}},\mathbf{0}) is of positive Lebesgue measure. By Theorem 4.7, we know that only 2d(p+1)pH=O(dH)2d(p+1)^{p}H=O(dH) samples are needed to determine the optimal policy almost surely.

Appendix D Technical claims

Claim D.1.

Let χ2(d)\chi^{2}(d) denote χ2\chi^{2}-distribution with degree of freedom dd. For any t>0t>0 we have,

Przχ2(d)(zd+2t+2dt)et\displaystyle\Pr_{z\sim\chi^{2}(d)}(z\geq d+2t+2\sqrt{dt})\leq e^{-t}

We use the following facts from [ZSJ+17].

Claim D.2.

Given a fixed vector zdz\in\mathbb{R}^{d}, for any C1C\geq 1 and n1n\geq 1, we have

Prx𝒩(0,Id)[|x,z|25Cz2logn]11/(ndC).\displaystyle\underset{x\sim{\cal N}(0,I_{d})}{\Pr}[|\langle x,z\rangle|^{2}\leq 5C\|z\|^{2}\log n]\geq 1-1/(nd^{C}).
Claim D.3.

For any C1C\geq 1 and n1n\geq 1, we have

Prx𝒩(0,Id)[x25Cdlogn]11/(ndC).\displaystyle\underset{x\sim{\cal N}(0,I_{d})}{\Pr}[\|x\|^{2}\leq 5Cd\log n]\geq 1-1/(nd^{C}).
Claim D.4.

Let a,b,c0a,b,c\geq 0 be three constants, let u,v,wdu,v,w\in\mathbb{R}^{d} be three vectors, we have

𝔼x𝒩(0,Id)[|ux|a|vx|b|wx|c]uavbwc.\displaystyle\underset{x\sim{\cal N}(0,I_{d})}{\mathbb{E}}\left[|u^{\top}x|^{a}|v^{\top}x|^{b}|w^{\top}x|^{c}\right]\approx\|u\|^{a}\|v\|^{b}\|w\|^{c}.
Claim D.5.

Let Mj,j[4]M_{j},j\in[4] be defined in Definition A.11. For each j[4]j\in[4], Mj=i=1kvimj,iw¯ijM_{j}=\sum_{i=1}^{k}v_{i}m_{j,i}\overline{w}_{i}^{\otimes j}.

Claim D.6.

Let {\cal B} denote a distribution over d1×d2\mathbb{R}^{d_{1}\times d_{2}}. Let d=d1+d2d=d_{1}+d_{2}. Let B1,B2,BnB_{1},B_{2},\cdots B_{n} be i.i.d. random matrices sampled from {\cal B}. Let B¯=𝔼B[B]\overline{B}=\mathbb{E}_{B\sim{\cal B}}[B] and B^=1ni=1nBi\widehat{B}=\frac{1}{n}\sum_{i=1}^{n}B_{i}. For parameters m0,γ(0,1),ν>0,L>0m\geq 0,\gamma\in(0,1),\nu>0,L>0, if the distribution {\cal B} satisfies the following four properties,

(1)\displaystyle\mathrm{({1})}\quad PrB[Bm]1γ;\displaystyle\quad\underset{B\sim{\cal B}}{\Pr}\left[\left\|B\right\|\leq m\right]\geq 1-\gamma;
(2)\displaystyle\mathrm{({2})}\quad 𝔼B[B]>0;\displaystyle\quad\left\|\underset{B\sim{\cal B}}{\mathbb{E}}[B]\right\|>0;
(3)\displaystyle\mathrm{({3})}\quad max(𝔼B[BB],𝔼B[BB])ν;\displaystyle\quad\max\left(\left\|\underset{B\sim{\cal B}}{\mathbb{E}}[BB^{\top}]\right\|,\left\|\underset{B\sim{\cal B}}{\mathbb{E}}[B^{\top}B]\right\|\right)\leq\nu;
(4)\displaystyle\mathrm{({4})}\quad maxa=b=1(𝔼B[(aBb)2])1/2L.\displaystyle\quad\max_{\|a\|=\|b\|=1}\left(\underset{B\sim{\cal B}}{\mathbb{E}}\left[\left(a^{\top}Bb\right)^{2}\right]\right)^{1/2}\leq L.

Then we have for any 0<ϵ<10<\epsilon<1 and t1t\geq 1, if

n(18tlogd)(ν+B¯2+mB¯ϵ)/(ϵ2B¯2) and γ(ϵB¯/(2L))2\displaystyle n\geq(18t\log d)\cdot(\nu+\|\overline{B}\|^{2}+m\|\overline{B}\|\epsilon)/(\epsilon^{2}\|\overline{B}\|^{2})\quad\text{\leavevmode\nobreak\ and\leavevmode\nobreak\ }\quad\gamma\leq(\epsilon\|\overline{B}\|/(2L))^{2}

with probability at least 11/d2tnγ1-1/d^{2t}-n\gamma,

B^B¯ϵB¯.\|\widehat{B}-\overline{B}\|\leq\epsilon\|\overline{B}\|.
Claim D.7.

Let P2P_{2} and P3P_{3} be defined in Definition A.16. Then

P2=i=1kvimj2,i(αw¯i)j22w¯i2P_{2}=\sum_{i=1}^{k}v_{i}m_{j_{2},i}(\alpha^{\top}\overline{w}_{i})^{{j_{2}-2}}\overline{w}_{i}^{\otimes 2}

and

P3=i=1kvimj3,i(αw¯i)j33w¯i3.P_{3}=\sum_{i=1}^{k}v_{i}m_{j_{3},i}(\alpha^{\top}\overline{w}_{i})^{{j_{3}-3}}\overline{w}_{i}^{\otimes 3}.