On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization
Abstract
Actor-critic algorithms have shown remarkable success in solving state-of-the-art decision-making problems. However, despite their empirical effectiveness, their theoretical underpinnings remain relatively unexplored, especially with neural network parametrization. In this paper, we delve into the study of a natural actor-critic algorithm that utilizes neural networks to represent the critic. Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics. To achieve that, we propose a Natural Actor-Critic algorithm with 2-Layer critic parametrization (NAC2L). Our approach involves estimating the -function in each iteration through a convex optimization problem. We establish that our proposed approach attains a sample complexity of . In contrast, the existing sample complexity results in the literature only hold for a tabular or linear MDP. Our result, on the other hand, holds for countable state spaces and does not require a linear or low-rank structure on the MDP.
1 Introduction
Motivation. The use of neural networks in Actor-critic (AC) algorithms is widespread in various machine learning applications, such as games (DBLP:journals/corr/abs-1708-04782; Haliem2021LearningMG), robotics (morgan2021model), autonomous driving (9351818), ride-sharing (zheng2022stochastic), networking (geng2020multi), and recommender systems (li2020video). AC algorithms sequentially update the estimate of the actor (policy) and the critic (value function) based on the data collected at each iteration, as described in (konda1999actor). An empirical and theoretical improvement of the AC, known as Natural Actor-Critic (NAC), was proposed in (peters2008natural). NAC replaced the stochastic gradient step of the actor with the natural gradient descent step described in (kakade2001natural) based on the theory of (rattray1998natural). Despite its widespread use in state-of-the-art reinforcement learning (RL) implementations, there are no finite-time sample complexity results available in the literature for a setup where neural networks (NNs) are used to represent the critic. Such results provide estimates of the required data to achieve a specified accuracy in a loss function, typically in terms of the difference between the average reward obtained in following the policy obtained through the algorithm being analyzed and the optimal policy.
Challenges. The primary challenge in obtaining sample complexity for a NAC is that the critic loss function with NN parametrization becomes non-convex, making finite-time sample guarantees impossible. Although recent work in (fu2020single) obtains upper bounds on the error in estimating the optimal action-value function in NN critic settings, these upper bounds are asymptotic and depend on the NN parameters. Similarly, other asymptotic sample complexity results, such as those in (kakade2001natural; konda1999actor; bhatnagar2010actor), require i.i.d. sampling of the state action pairs at each iteration of the algorithm. While sample complexity results for linear MDP structures have been derived in (wu2020finite; xu2020improving) (see Related work), the structure allows for a closed-form critic update. The relaxation of the linear MDP assumption requires a critic parametrization to learn the Q-function, limiting the ability to find globally optimal policies due to non-global optimality of critic parameters.
Approach. As the optimal critic network update is not available for general MDPs, a parametrization of the critic network is used. To address the challenge of optimizing the critic parameters, we assume a two-layer neural network structure for the critic parametrization. Recent studies have shown that the parameter optimization for two-layer ReLU neural networks can be transformed into an equivalent convex program, which is computationally tractable and solvable exactly (pmlr-v119-pilanci20a). Convex formulations for convolutions and deeper models have also been investigated (sahiner2020vector; sahiner2020convex). In this work, we use these approaches to estimate the parameterized Q-function in the critic. In the policy gradient analysis, the error caused by the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (see related work for details), where the sample complexity of this term is not accounted due to no explicit critic estimation. This paper considers the sample complexity of the critic function to derive the overall sample complexity of the natural actor-critic algorithm. Resolving the optimality of the critic parameter update, this paper considers the following question:
Is it possible to obtain global convergence sample complexity results of Natural Actor-Critic algorithm with two-layer neural network parametrization of critic?
Contributions. In this work, we provide an affirmation to the above-mentioned question. We summarize our contributions as follows.
-
•
We propose a novel actor-critic algorithm, employing a general parametrization for the actor and two-layer ReLU neural network parametrization for the critic.
-
•
Building upon the insights presented in (agarwal2020optimality), we leverage the inherent smoothness property of the actor parametrization to derive an upper bound on the estimation error of the optimal value function. This upper bound is expressed in terms of the error incurred in attaining the compatible function approximation term, as elucidated in (DBLP:conf/nips/SuttonMSM99). Our analysis dissects this error into distinct components, each of which is individually bounded from above.
-
•
To address the estimation error arising from the critic function, we leverage the interesting findings of (wang2021hidden; pmlr-v119-pilanci20a), converting the problem of finding critic function parameters into a convex optimization problem. This allows us to establish a finite-time sample complexity bound, a significant improvement over previous results obtained for NAC algorithms. Notably, our approach eliminates the reliance on a linear function approximation for the critic, presenting the first finite-time sample complexity result for NAC without such an approximation.
2 Related Works
Natural Actor-Critic (NAC). Actor critic methods, first conceptualised in (sutton1988learning), aim to combine the benefits of the policy gradient methods and -learning based methods. The policy gradient step in these methods is replaced by a Natural Policy Gradient proposed in (kakade2001natural) to obtain the so-called Natural Actor Critic in (peters2005natural). Sample complexity results for Actor Critic were first obtained for MDP with finite states and actions in (williams1990mathematical), and more recently in (lan2023policy; zhang2020provably). Finite time convergence for natural actor critic using a linear MDP assumption has been obtained in (chen2022finite; khodadadian2021finite; xu2020non) with the best known sample complexity of (xu2020improving). Finite time sample complexity results are however, not available for Natural Actor Critic setups for general MDP where neural networks are used to represent the critic. (fu2020single) obtained asymptotic upper bounds on the estimation error for a natural actor critic algorithm where a neural network is used to represent the critic. The key related works here are summarized in Table 1.
Sample Complexity of PG Algorithms. Policy Gradient methods (PG) were first introduced in (DBLP:conf/nips/SuttonMSM99), with Natural Policy Gradients in (kakade2001natural). Sample complexity were obtained in tabular setups in (even2009online) and then improved in (agarwal2020optimality). Using linear dynamics assumption convergence guarantees were obtained in (fazel2018global), (dean2020sample), (bhandari2019global) . (zhang2020global) obtains sample complexity for convergence to second order stationary point policies. In order to improve the performance of Policy Gradient methods, techniques such as REINFORCE (williams1992simple), function approximation (sutton1999policy), importance weights (papini2018stochastic), adaptive batch sizes (papini2017adaptive), Adaptive Trust Region Policy Optimization (shani2020adaptive) have been used. We note that the Natural Policy Gradient approach has been shown to achieve a sample complexity of (liu2020improved). However, they assume that the error incurred due to the lack of knowledge of the exact gradient term is assumed to be upper bounded by a constant (Assumption 4.4, (liu2020improved)). Additionally, each estimate of the critic requires on average a sample of size and the obtained estimate is only known to be an unbiased estimate of the critic, with no error bounds provided. Explicit dependence of the constant of (Assumption 4.4, (liu2020improved)) in terms of number of samples is not considered in their sample complexity results. For the Natural Actor Critic analysis in this paper, we provide explicit dependence of the error in estimating the gradient update and incorporate the samples required for the estimation in the sample complexity analysis. We describe this difference in detail in Appendix LABEL:Comp_npg.
References |
|
|
|
|
||||||||
(williams1990mathematical) | Tabular | Tabular | Finite | Asymptotic | ||||||||
(borkar1997actor) | Tabular | Tabular | Finite | Asymptotic | ||||||||
(xu2020non) | General | None (Closed Form) | Linear | |||||||||
(khodadadian2021finite) | General | None (Closed Form) | Linear | |||||||||
(xu2020improving) | General | None (Closed Form) | Linear | |||||||||
(fu2020single) | Neural Network | Neural Network | None | Asymptotic | ||||||||
This work | General | Neural Network (2-layer) | None |
3 Problem Setup
We consider a discounted Markov Decision Process (MDP) given by the tuple , where is a bounded measurable state space, is the finite set of actions. is the probability transition kernel111 For a measurable set , let denote the set of all probability measures over . , is the reward kernel on the state action space with being the absolute value of the maximum reward, and is the discount factor. A policy maps a state to a probability distribution over the action space. Here, we denote by , the set of all probability distributions over the state space, the action space, and a closed interval , respectively. With the above notation, we define the action value function for a given policy as
(1) |
where , and for . For a discounted MDP, we define the optimal action value functions as . A policy that achieves the optimal action value functions is known as the optimal policy and is denoted as . It holds that is the greedy policy with respect to (10.5555/1512940). Hence finding is sufficient to obtain the optimal policy. In a similar manner, we can define the value function as and from the definition in (1), it holds that . Hence, we can define the optimal value function as . We define as the stationary state distribution induced by the policy starting at state and is the corresponding stationary state action distribution defined as . We overload notation to define , where is an initial state distribution. We can define the visitation distribution as . Here denotes the probability the state at time is given a starting state of . Hence, we can write .
We now describe our Neural Actor-Critic (NAC) algorithm. In a natural policy gradient algorithm (NIPS2001_4b86abe4), the policy is parameterized as . We have total iterations of the algorithm. At iteration , the policy parameters are updated using a gradient descent step of the form
(2) |
From the policy gradient theorem (DBLP:conf/nips/SuttonMSM99) we have
(3) | |||||
(4) |
where . From (DBLP:conf/nips/SuttonMSM99), the principle of compatible function approximation implies that we have
(5) | |||||
(6) |
and Here . For natural policy gradient algorithms such as in (agarwal2020optimality) and (liu2020improved) an estimate of (and from that an estimate of ) is obtained through a sampling procedure that requires on average for each sample of . For the natural actor critic setup we maintain a parameterised estimate of the -function is updated at each step and is used to approximate .
(xu2020improving) and (wu2020finite) assume that the function can be represented as where is a known feature vector and is a parameter that is updated in what is known as the critic step. The policy gradient step is known as the actor update step. In case a neural network is used to represent the function, at each iteration of the algorithm, an estimate of its parameters are obtained by solving an optimization of the form
(7) |
Due to the non-convex activation functions of a neural network the optimization in Equation (7) is non-convex and hence finite sample guarantees for actor critic algorithms with neural network representation of functions is not possible. Thus in order to perform a finite time analysis, a linear MDP structure is assumed which is very restrictive and not practical for large state space problems. For a Natural Policy Gradient setup, a Monte Carlo sample is obtained, which makes the process require additional samples as causing high variance in critic estimation.
4 Proposed Approach
4.1 Convex Reformulation of 2 layer Neural Network
We represent the function (critic) using a 2 layer ReLU neural network. A 2-layer ReLU Neural Network with input is defined as , where is the number of neurons in the neural network, the parameter space is and is an element of the parameter space, where is the column of , and is the coefficient of . The function is the ReLU or restricted linear function defined as . In order to obtain parameter for a given set of data and the corresponding response values , we desire the parameter that minimizes the squared loss, given by
(8) |
In (8), we have the term which is a vector , where is the row of . It is the ReLU function applied to each element of the vector . We note that the optimization in Equation (8) is non-convex in due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as
(9) |
where . is the set of diagonal matrices which depend on the data-set . Except for cases of being low rank, it is not computationally feasible to obtain the set . We instead use to solve the convex problem in (9) where now would lie in . The relevant details of the formulation and the definition of the diagonal matrices are provided in Appendix A. For a set of parameters , we denote neural network represented by these parameters as
(10) |
4.2 Proposed Natural Actor Critic Algorithm with 2-Layer Critic Parametrization (NAC2L)
Input: Time Horizon K , Updates per time step J ,starting state action sampling distribution , Number of convex optimization steps , Actor SGD learning rate
(14) |
We summarize the proposed approach in Algorithm 1. Algorithm 1 has an outer for loop with two inner for loops. At a fixed iteration of the outer for loop and iteration of the first inner for loop, we obtain a sequence of state action pairs and the corresponding state and reward by following the estimate of the policy at the start of the iteration. In order to perform the critic update, the state action pairs and the corresponding target values are stored in matrix form and passed to Algorithm 2, as the input and output values respectively to solve the following optimization problem.
(15) |
where is the estimate of the function at the iteration of the outer for loop and the iteration of the first inner for loop of Algorithm 1. is a neural network defined as in (10) and is the number of state action pairs sampled at the iteration of the outer for loop and the iteration of the first inner for loop of Algorithm 1. This is done at each iteration of the first inner for loop to perform what is known as a Fitted Q-iteration step to obtain the estimate of the critic.
Algorithm 2 first samples a set of diagonal matrices denoted by in line 2 of Algorithm 2. The elements of act as the diagonal matrix replacement of the ReLU function. Algorithm 2 then solves an optimization of the form given in Equation (15) by converting it to an optimization of the form (34). This convex optimization is solved in Algorithm 2 using the projected gradient descent algorithm. After obtaining the optima for this convex program, denoted by , in line 10, we transform them into an estimate of the solutions for the optimization given in (15), which are then passed back to Algorithm 1. The procedure is described in detail along with the relevant definitions in Appendix A.
The estimate of is obtained in the second inner for loop of Algorithm (1) where a gradient descent is performed for the loss function of the form given in Equation (6) using the state action pairs sampled in the first inner for loop. Note that we do not have access to the true function that is required for the critic update. Thus we use the estimate of the function obtained at the end of the first inner for loop. After obtaining our estimate of the minimizer of Equation (6), we update the policy parameter using the stochastic gradient update step. Here the state action pairs used are the same we sampled in the first inner for loop.
5 Global Convergence Analysis of NAC2L Algorithm
5.1 Assumptions
In this subsection, we formally describe the assumptions that will be used in the results.
Assumption 1
For any and we have
(16) |
where .
Such assumptions have been utilised in prior policy Gradient based works such as (agarwal2020optimality; liu2020improved). This assumption is satisfied for the softmax policy parameterization
(17) |
where is a neural network with a smooth activation function (agarwal2020optimality).
Assumption 2
Thus, is a measure of the error incurred due to taking a sample of diagonal matrices and not the full set . In practice, setting to be the same order of magnitude as (dimension of the data) gives us a sufficient number of diagonal matrices to get a reformulation of the non convex optimization problem which performs comparably or better than existing gradient descent algorithms, therefore is only included for theoretical completeness and will be negligible in practice. This has been practically demonstrated in (pmlr-v162-mishkin22a; pmlr-v162-bartan22a; pmlr-v162-sahiner22a). Refer to Appendix A for details of , and .
Assumption 3
We assume that for all functions , there exists a function where such that
(19) |
for any .
reflects the error that is incurred due to the inherent lack of expressiveness of the neural network function class. In the analysis of (pmlr-v120-yang20a), this error is assumed to be zero. Explicit upper bounds of is given in terms of neural network parameters in works such as (yarotsky2017error).
Assumption 4
We assume that for all functions and , there exists such that
(20) |
for any .
Assumption 4 is similar to the Transfer error assumption in works such as (agarwal2020flambe; liu2020improved). The key difference is that in the referenced works the assumption is based on the true function, while the estimate of is obtained by using a noisy Monte Carlo estimate. For our case we use a known parameterised function to obtain our estimate of .
Assumption 5
For any , denote by as the policy corresponding to the parameter and as the corresponding stationary state action distribution of the induced Markov chain. We assume that there exists a positive integer such that
(21) |
This assumption implies that the Markov chain is geometrically mixing. Such assumption is widely used both in analysis of stochastic gradient descent literature such as (9769873; sun2018markov), as well as finite time analysis of RL algorithms such as (wu2020finite; NEURIPS2020_2e1b24a6).
Assumption 6
Let be a probability measure on which is absolutely continuous with respect to the Lebesgue measure. Let be a sequence of policies and suppose that the state action pair has an initial distribution of . Then we assume that for all there exists a constant such that
(22) |
for all , where denotes the Radon Nikodym derivative of the state action distribution with respect to the distribution .
This assumption puts an upper bound on the difference between the state action distribution and the state action distribution induced by sampling a state action pair from the distribution followed by any possible policy for the next steps for any finite value of . Similar assumptions have been made in (pmlr-v120-yang20a; JMLR:v17:10-364; farahmand2010error).
5.2 Main Result
Theorem 1
Suppose Assumptions 1-6 hold. Let Algorithm 1 run for iterations be the number of iterations of the first inner loop of Algorithm 1. Let denote the number of state-action pairs sampled and the number of iterations of Algorithm 2 at iteration of the outer for loop and iteration of the first inner for loop of Algorithm 1. Let be the projected gradient descent size at iteration of Algorithm 2 and the number of diagonal matrices sampled in Algorithm 2. Let be the step size in the projected gradient descent at iteration of the second inner for loop of Algorithm 1. Let be the starting state action distribution at the each iteration of Algorithm 1. If we have, , and , then we obtain
(23) |
where are constants.
Hence, for , , ,
we have
(24) |
which implies a sample complexity of .
6 Proof Sketch of Theorem 1
The detailed proof of Theorem 1 is given in Appendix E. The difference between our estimated value function denoted by and the optimal value function denoted by (where is the policy obtained at the step of algorithm 1) is first expressed as a function of the compatible function approximation error, which is then split into different components which are analysed separately. The proof is thus split into two stages. In the first stage, we demonstrate how the difference in value functions is upper bounded as a function of the errors incurred till the final step . The second part is to upper bound the different error components.
Upper Bounding Error in Separate Error Components: We use the smoothness property assumed in Assumption 1 to obtain a bound on the expectation of the difference between our estimated value function and the optimal value function.
(25) |
where
(26) |
and and is a constant such that , where denotes the iteration of the outer for loop of Algorithm 1. We split the term in (26) into the errors incurred due to the actor and critic step as follows
(27) | |||||
(28) |
Note that is the difference between the true function corresponding to the policy and is our estimate. This estimation is carried out in the first inner for loop of Algorithm 1. Thus is the error incurred in the critic step. is the error incurred in the estimation of the actor update. This is incurred in the stochastic gradient descent steps in the second inner for loop of Algorithm 1.
Also note that the expectation is with respect to the discounted state action distribution of the state action pairs induced by the optimal policy . The state action samples that we obtain are obtained from the policy . Thus using assumption 6 we convert the expectation in Equation (28) to an expectation with respect to the stationary state action distribution induced by the policy .
Upper Bounding Error in Critic Step: For each iteration of the Algorithm 1. We show that minimzing is equivalent to solving the following problem
(29) |
We will rely upon the analysis laid out in (farahmand2010error) and instead of the iteration of the value functions, we will apply a similar analysis to the action value function to obtain an upper bound for the error incurred in solving the problem in Equation (29) using the Fitted Q-Iteration. We recreate the result for the value function from Lemmas 2 of (munos2003error) for the action value function to obtain
(30) |
where is the Bellman error incurred at iteration where are defined as in Equations (51), (53) respectively. denotes the estimate at the final iteration of the first inner for loop and iteration of the outer for loop of Algorithm 1
The first term on the right hand side is called as the algorithmic error, which depends on how good our approximation of the Bellman error is. The second term on the right hand side is called as the statistical error, which is the error incurred due to the random nature of the system and depends only on the parameters of the MDP as well as the number of iterations of the FQI algorithm. Intuitively, this error depends on how much data is collected at each iteration, how efficient our solution to the optimization step is to the true solution, and how well our function class can approximate . Building upon this intuition, we split into four different components as follows.
(31) | |||||
We use the Lemmas 14, 15, 16, and 17 to bound the error terms in Equation (31).
Upper Bounding Error in Actor Step: Note that we require the minimization of the term . Here the expectation is with respect to stationary state action distribution corresponding to . But we do not have samples of states action pairs from the stationary distribution with respect to the policy , we only have samples from the Markov chain induced by the policy . We thus refer to the theory in (9769873) and Assumption 5 to upper bound the error incurred.
7 Conclusions
In this paper, we study a Natural Actor Critic algorithm with a neural network used to represent both the actor and the critic and find the sample complexity guarantees for the algorithm. Using the conversion of the optimization of a 2 layer ReLU Neural Network to a convex problem for estimating the critic, we show that our approach achieves a sample complexity of . This demonstrates the first approach for achieving sample complexity beyond linear MDP assumptions for the critic.
Limitations: Relaxing the different stated assumptions is an interesting direction for the future. Further, the results assume 2-layer neural network parametrization for the critic. One can likely use the framework described in (belilovsky2019greedy) to extend the results to a multi layer setup.
A Convex Reformulation with Two-Layer Neural Networks
For representing the action value function, we will use a 2 layer ReLU neural network. In this section, we first lay out the theory behind the convex formulation of the 2 layer ReLU neural network. In the next section it will shown how it is utilised for the FQI algorithm.
In order to obtain parameter for a given set of data and the corresponding response values , we desire the parameter that minimizes the squared loss (with a regularization parameter ), given by
(32) |
Here, we have the term which is a vector where is the row of . It is the ReLU function applied to each element of the vector . We note that the optimization in Equation (8) is non-convex in due to the presence of the ReLU activation function. In (wang2021hidden), it is shown that this optimization problem has an equivalent convex form, provided that the number of neurons goes above a certain threshold value. This convex problem is obtained by replacing the ReLU functions in the optimization problem with equivalent diagonal operators. The convex problem is given as
(33) |
where . is the set of diagonal matrices which depend on the data-set . Except for cases of being low rank it is not computationally feasible to obtain the set . We instead use to solve the convex problem
(34) |
where . In order to understand the convex reformulation of the squared loss optimization problem, consider the vector
(35) |
Now for a fixed , different will have different components of that are non zero. For example, if we take the set of all such that only the first element of are non zero (i.e, only and ) and denote it by the set , then we have
(36) |
where is the diagonal matrix with only the first diagonal element equal to and the rest . Similarly, there exist a set of which result in having certain components to be non-zero and the rest zero. For each such combination of zero and non-zero components, we will have a corresponding set of and a corresponding Diagonal matrix . We define the possible set of such diagonal matrices possible for a given matrix X as
(37) |
where represents a matrix given by
(38) |
where if and if Corresponding to each such matrix , there exists a set of given by
(39) |
where is the identity matrix. The number of these matrices is upper bounded by . From (wang2021hidden) the upper bound is where . Also, note that the sets form a partition of the space . Using these definitions, we define the equivalent convex problem to the one in Equation (8) as
(40) |
where , , , note that by definition, for any fixed at-least one of or are zero. If are the optimal solutions to Equation (40), the number of neurons of the original problem in Equation (8) should be greater than the number of elements of , which have at-least one of or non-zero. We denote this value as , with the subscript denoting that this quantity depends upon the data matrix and response .
We convert to optimal values of Equation (8), denoted by , using a function defined as follows
(44) |
where according to (pmlr-v119-pilanci20a) we have , for all where are the elements of . Note that restriction of to is shown to be valid in (pmlr-v162-mishkin22a). For we set .
Since is hard to obtain computationally unless is of low rank, we can construct a subset and perform the optimization in Equation (40) by replacing with to get
(45) |
where , , , by definition, for any fixed at-least one of or are zero.
The required condition for to be a sufficient replacement for is as follows. Suppose denote the optimal solutions of Equation (45). Then we require
(46) |
Or, the number of neurons in the neural network are greater than the number of indices for which at-least one of or is non-zero. Further,
(47) |
In other words, the diagonal matrices induced by the optimal ’s of Equation (8) must be included in our sample of diagonal matrices. This is proved in Theorem 2.1 of (pmlr-v162-mishkin22a).
A computationally efficient method for obtaining and obtaining the optimal values of the Equation (8), is laid out in (pmlr-v162-mishkin22a). In this method we first get our sample of diagonal matrices by first sampling a fixed number of vectors from a dimensional standard multivariate distribution, multiplying the vectors with the data matrix and then forming the diagonal matrices based of which co-ordinates are positive. Then we solve an optimization similar to the one in Equation (40), without the constraints, that its parameters belong to sets of the form as follows.
(48) |
where . In order to satisfy the constraints of the form given in Equation (40), this step is followed by a cone decomposition step. This is implemented through a function . Let be the optimal solution of Equation (48). For each we define a function as
(49) | |||||
Then we obtain . As before, at-least one of , is . Note that in practice we do not know if the conditions in Equation (46) and (47) are satisfied for a given sampled . We express this as follows. If was the full set of Diagonal matrices then we would have and for all . However, since that is not the case and , this means that is an optimal solution of a non-convex optimization different from the one in Equation (8). We denote this non-convex optimization as defined as
(50) |
where or the size of the sampled diagonal matrix set. In order to quantify the error incurred due to taking a subset of , we assume that the expectation of the absolute value of the difference between the neural networks corresponding to the optimal solutions of the non-convex optimizations given in Equations (50) and (8) is upper bounded by a constant depending on the size of . The formal assumption and its justification is given in Assumption 2.
B Error Characterization
Before we define the errors incurred during the actor and critic steps, we define some additional terms as follows
We define the Bellman operator for a policy as follows
(51) |
where Similarly we define the Bellman Optimality Operator as
Similarly we define the Bellman Optimality Operator as
(52) |
Further, operator is defined as
(53) |
which is the one step Markov transition operator for policy for the Markov chain defined on with the transition dynamics given by and . It defines a distribution on the state action space after one transition from the initial state. Similarly, is the -step Markov transition operator following policy at steps . It defines a distribution on the state action space after transitions from the initial state. We have the relation
(54) | ||||
(55) |
We thus defines as
(56) |
in other words, is the one step Markov transition operator with respect to the greedy policy of the function on which it is acting. which implies that
(57) |
For any measurable function , we also define
(58) |
for any distribution .
We now characterize the errors which are incurred from the actor and critic steps. We define as as the stationary state action distribution induced by the policy with the starting state action distribution drawn from a distribution . For the error incurred in the actor update we define the related loss function as
Definition 1
For finding the estimate , we re-use the state action pairs sampled in the first inner for loop of Algorithm 1. Note that we have to solve for the loss function where the expectation is with respect to the steady state distribution , while our sample are from a markov chain which has the steady state distribution For the error incurred in the critic update, we first define the various possible -functions which we can approximate in decreasing order of the accuracy.
For the error compnents incurred during critic estimation, we start by defining the best possible approximation of the function possible from the class of two layer ReLU neural networks, with respect to the expected square from the true ground truth .
Definition 2
For iteration of the outer for loop and iteration of the first inner for loop of Algorithm 1, we define
(60) |
where .
Note that we do not have access to the transition probability kernel , hence we cannot calculate . To alleviate this, we use the observed next states to estimate the -value function. Using this, we define as,
Definition 3
For iteration of the outer for loop and iteration of the first inner for loop of Algorithm 1, we define
(61) |
where and .
Compared to , in , we are minimizing the expected square loss from target function .
To obtain , we still need to compute the true expected value in Equation 61. However, we still do not know the transition function . To remove this limitation, we use sampling. Consider a set, , of state-action pairs sampled as where . We now define as,
Definition 4
For a given set of state action pairs we define
(62) |
where , and are the observed reward and the observed next state for state action pair respectively.
is the best possible approximation for -value function which minimizes the sample average of the square loss functions with the target values as or the empirical loss function. After defining the possible solutions for the -values using different loss functions, we define the errors.
We first define approximation error which represents the difference between and its best approximation possible from the class of 2 layer ReLU neural networks. We have
Definition 5 (Approximation Error)
We also define Estimation Error which denotes the error between the best approximation of possible from a 2 layer ReLU neural network and . We demonstrate that these two terms are the same and this error is zero.
Definition 6 (Estimation Error)
We now define Sampling error which denotes the difference between the minimizer of expected loss function and the minimizer of the empirical loss function using samples, . We will use Rademacher complexity results to upper bound this error.
Definition 7 (Sampling Error)
Lastly, we define optimization error which denotes the difference between the minimizer of the empirical square loss function, , and our estimate of this minimizer that is obtained from the projected gradient descent algorithm.
C Supplementary lemmas and Definitions
Here we provide some definitions and results that will be used to prove the lemmas stated in the paper.
Definition 9
For a given set , we define the Rademacher complexity of the set as
(63) |
where is random variable such that , and are the co-ordinates of which is an element of the set
Lemma 10
Consider a set of observed data denoted by , a parameter space , a loss function where . The empirical risk for a set of observed data as and the population risk as , where is a co-ordinate of sampled from some distribution over .
We define a set of functions denoted by as
(64) |
Given we further define a set as
(65) |
Then, we have
(66) |
If the data is of the form and the loss function is of the form , is lipschitz and , then we have
(67) |
where
(68) |
The detailed proof of the above statement is given in (Rebeschini, 2022)222 Algorithmic Foundations of Learning [Lecture Notes]. https://www.stats.ox.ac.uk/ rebeschi/teaching/AFoL/20/material/. The upper bound for is proved in the aformentioned reference. However, without loss of generality the same proof holds for the upper bound for . Hence the upper bound for can be established.
Lemma 11
Consider two random random variable and . Let and , denote the expectation with respect to the joint distribution of , the marginal distribution of , the conditional distribution of given and the conditional distribution of given respectively . Let denote a bounded measurable function of parameterised by some parameter and be bounded measurable function of both and .
Then we have
(69) |
Proof Denote the left hand side of Equation (69) as , then add and subtract to it to get
(70) | ||||
(71) |
Consider the third term on the right hand side of Equation (71)
(72) | ||||
(73) | ||||
(74) | ||||
(75) | ||||
(76) |
Equation (72) is obtained by writing from the law of total expectation. Equation (73) is obtained from (72) as the term is not a function of . Equation (74) is obtained from (73) as because is not a function of hence is constant with respect to the expectation operator .
Thus plugging in value of in Equation (71) we get
(77) |
Note that the second term on the right hand side of Equation (77) des not depend on therefore we can write Equation (77) as
(78) |
Since the right hand side of Equation (78) is not a function of we can replace with to get
(79) |
Lemma 12
Consider an optimization of the form given in Equation (45) with the regularization term denoted by and it’s convex equivalent denoted by . Then the value of these two loss functions evaluated at and respectively are equal and thus we have
(80) |
(81) | |||||
(82) |
where , represent the first and second coordinates of respectively.
For any fixed consider the two terms
(83) | |||
(84) |
For a fixed either or is zero. In case both are zero, both of the terms in Equations (83) and (84) are zero as . Assume that for a given . Then we have . Then equations (83), (84) are.
(85) | |||
(86) |
But by definition of we have , therefore Equations (85), (86) are equal. Alternatively if for a given , then , then the terms in (83), (84) become.
(87) | |||
(88) |
By definition of we have , then the terms in (87), (87) are equal. Since this is true for all , we have
(89) |
Lemma 13
The function defined in equation (10) is Lipschitz continuous in , where is considered a vector in with the assumption that the set of all possible belong to the set , where is some fixed value.
Proof
First we show that for all we have for all
Note that
(90) |
where with denote the component of respectively.
By construction can only be , or . Therefore if then if both non zero or if one is zero. Therefore . Which leads to a contradiction.
Therefore for all and we also have
(91) |
is defined as
(92) |
From Proposition in (10.5555/3327144.3327299) the function is Lipschitz continuous in , therefore there exist such that
(93) | |||||
(94) |
If we consider a single neuron of , for example , we have such that
(95) |
Now consider Equation (95), but instead of considering the left hand side a a function of consider it a function of where we consider the difference between evaluated at and such that
(96) |
for some .
Similarly, for all other if we change to to be unchanged we have
(97) |
for all if both .
Therefore we obtain
(98) | |||||
(99) | |||||
(100) | |||||
(101) |
This result for a fixed . If we take the supremum over on both sides we get
(102) |
Denoting , we get
(104) | |||||
D Supporting Lemmas
We will now state the key lemmas that will be used for finding the sample complexity of the proposed algorithm.
Lemma 14
Where the expectation is with respect to and
Proof Sketch: We use Assumption 3 and the definition of the variance of a random variable to obtain the required result. The detailed proof is given in Appendix F.1.
Lemma 15
Proof Sketch: We use Lemma 11 in Appendix C and use the definitions of and to prove this result. The detailed proof is given in Appendix F.2.
Lemma 16
Where the expectation is with respect to and
Proof Sketch: First we note that For a given iteration of Algorithm 1 and iteration of the first for loop of Algorithm 1, where and are defined in Appendix F.3. We use this to get a probabilistic bound on the expected value of using Rademacher complexity theory when the samples are drawn from an ergodic Markov chain. The detailed proof is given in Appendix F.3. Note the presence of the term is due to the fact that the state action samples belong to a Markov Chain.
Lemma 17
For a given iteration of Algorithm 1 and iteration of the first for loop of Algorithm 1, let the number of steps of the projected gradient descent performed by Algorithm 2, denoted by , and the gradient descent step size satisfy
(107) |
for some constants and . Then the error defined in Definition 8 is upper bounded as
(108) |
Where the expectation is with respect to .
Proof Sketch: We use the number of iterations required to get an bound on the difference between the minimum objective value and the objective value corresponding to the estimated parameter at iteration . We use the convexity of the objective and the Lipschitz property of the neural network to get a bound on the functions corresponding to the estimated parameters. The detailed proof is given in Appendix F.4.
Lemma 18
For a given iteration of Algorithm 1 and iteration of the first for loop of Algorithm 1, if the number of samples of the state action pairs sampled are denoted by and be the step size in the projected gradient descent at iteration of the second inner for loop of Algorithm 1 which satisfies
(109) |
where is the strong convexity parameter of . Then, it holds that,
(110) |
Proof Sketch: Note that we don not have access to state action samples belonging to the stationary state action distribution corresponding to the policy . We only have access to samples from Markov chain with the same stationary state action distribution. To account for this, we use the results in (9769873) and obtain the difference between the optimal loss function and the loss function obtained by performing stochastic gradient descent with samples from a Markov chain.
E Proof of Theorem 1
Proof
From Assumption 1, we have
(111) | |||||
(112) |
Thus we have,
(113) | |||||
(114) |
From the definition of KL divergence and from the performance difference lemma from (kakade2002approximately) we have
(115) | ||||
(116) | ||||
(117) | ||||
(118) |
Equation (116) is obtained from Equation (115) using the result in Equation (112). (117) is obtained from Equation (116) using the performance difference lemma form (kakade2002approximately) where is the advantage function to the corresponding function .
Rearranging, we get
Summing from to and dividing by we get
(120) | ||||
(121) | ||||
(122) |
If we set in Equation (122) we get
(123) |
Now consider the term , we have from Equation (28)
(125) |
where is the estimate of obtained at the iteration of Algorithm 1 and .
We first derive bounds on . From the definition of advantage function we have
(128) |
We write the second term on the right hand side of Equation (128) as where is the measure associated with the state action distribution given by . Then we have
(129) |
where is the measure associated with the state action distribution given by . Using Assumption 6 we have . Thus Equation (129) becomes
(130) |
Since Equation (128) now becomes.
Therefore minimizing is equivalent to minimizing .
In order to prove the bound on we first define some notation, let be two real valued functions on the state action space. The expression implies .
Let denotes our estimate of the action value function at iteration of Algorithm 1 and iteration of the first for loop of Algorithm 1 and we denote where is the total number of iterations of the first for loop of Algorithm 1. denotes the action value function induced by the policy .
Consider .
(132) |
Thus we get,
(133) | ||||
(134) | ||||
(135) | ||||
(136) |
Right hand side of Equation (133) is obtained by writing . This is because the function is a stationary point with respect to the operator . Equation (134) is obtained from (133) by adding and subtracting . Equation (136) is obtained from (135) as and is the operator with respect to the greedy policy of .
By recursion on , we get,
(137) |
using (from definition of ) and from definition of operator .
From this we obtain
(138) |
Taking absolute value on both sides of Equation (138) we get.
(139) |
For a fixed consider the term using Assumption 6 we write
(140) | |||||
(141) |
Here is the measure associated with the state action distribution given by sampling from and then applying the operator times. is the measure associated with the steady state action distribution given by . Thus Equation (139) becomes
(143) |
We get the second term on the right hand side by noting that . Now splitting as was done in Equation (31) we obtain
(144) |
(145) | |||||
From Equation (LABEL:proof_13_1_6) we get
(146) | |||||
We now derive bounds on . From Theorem 2 of (9769873) we have that
(147) |
Now define the function . From this definition we obtain
(148) |
where is lipschitz constant of . Thus we obtain
(149) |
From Assumption 4 we have
(150) |
which gives us
(151) |
(152) | ||||
(153) | ||||
(154) |
F Proof of Supporting Lemmas
F.1 Proof Of Lemma 14
Proof
where .
Since we obtain
(156) |
We have for a random variable , hence , Therefore replacing with we get
using the definition of the variance of a random variable we get
(157) |
Therefore we get
(158) |
Since we have
(159) |
F.2 Proof Of Lemma 15
Proof From Lemma 11, we have
(160) |
We label to be the state action pair , function to be and to be the function , where is the two dimensional random variable and , and .
Then the loss function in (69) becomes
(161) |
Therefore by Lemma 11, we have that the function which minimizes Equation (161) it will be minimizing
(162) |
But we have from Equation that
(163) |
(164) |
The left hand side of Equation (164) is as defined in Definition 3 and the right hand side is as defined in Definition 2, which gives us
(165) |
F.3 Proof Of Lemma 16
Proof
We define as
Here, , where are sampled from a Markov chain whose stationary distribution is, . is as defined in Equation (10) and is the estimate of the function obtained at iteration of the outer for loop and iteration of the first inner for loop of Algorithm 1.
We also define the term
(166) |
where
We denote by the parameters of the neural networks respectively. are defined in Definition 3 and 4 respectively.
We then obtain,
We get the inequality in Equation (LABEL:2_2_2) because as is the minimizer of the loss function .
Consider Lemma 10. The loss function can be written as the mean of loss functions of the form where is the square function. and . Thus we have
(170) | |||
Note that the expectation is over all . Where , and is the Lipschitz constant for the square function over the state action space . The expectation is with respect to , .
From proposition 11 of (article) we have that
(171) |
Note that proposition 11 of (article) establishes an upper bound on the Rademacher complexity using theorem 4 of (article) with the aim of applying it to the Metropolis Hastings algorithm For our purpose we only use the upper bound on Rademacher complexity established in proposition 11 of (article).
We use this result as the state action pairs are drawn not from the stationary state of the policy but from a Markov chain with the same steady state distribution. Thus we have
(172) |
The same argument can be applied for to get
(173) |
Then we have
(174) |
Plugging in the definition of in equation (174) and denoting as we get
(175) |
Now for a fixed consider the term defined as.
(176) |
where are drawn from the state action distribution at the step of the Markov chain induced by following the policy .
Now for a fixed consider the term defined as.
(177) |
where are drawn from the steady state action distribution with and . Note here that and are the same function with only the state action pairs being drawn from different distributions.
Using these definitions we obtain
(178) | |||||
(179) |
We obtain Equation (178) by using the identity , where and are two finite state action probability measures and is a bounded measurable function. We have used to represent the total variation distance between the state action measures of the steady state action distribution denoted by and the state action distribution at the step of the Markov chain induced by following the policy . The expectation is with respect to . We obtain Equation (179) from Equation (178) by using Assumption 5 and the fact that and are upper bounded by
From equation (179) we get
(180) |
In Equation (LABEL:2_2_5_3_7) are now drawn from and for al .
We ignore the second term on the right hand side as it is as compared to the first term which is . Additionally the expectation in Equation (LABEL:2_2_5_3_7) is wiht respect to
Since now we have for all , Equation (LABEL:2_2_5_3_7) is equivalent to,
Where the expectation is now over , and . We re-write Equation (LABEL:2_2_5_4) as
(183) |
Where is the state action distribution , , are the measures with respect to , and respectively
Now for the integral in Equation (183) we split the integral into four different integrals. Each integral is over the set of corresponding to the 4 different combinations of signs of .
(184) |
Now note that the first 2 terms are non-negative and the last two terms are non-positive. We then write the first two terms as
We write the last two terms as
(187) | |||
(188) |
Here and are positive constants. Plugging Equations (LABEL:2_2_5_7), (LABEL:2_2_5_8), (187), (188) into Equation (183).
(189) |
which implies
(191) |
Thus we have
(193) |
which implies
(195) |
F.4 Proof Of Lemma 4
Proof For a given iteration of Algorithm 1 and iteration of the first inner for loop, the optimization problem to be solved in Algorithm 2 is the following
(197) |
Here, is the estimate of the function from the iteration and the state action pairs have been sampled from a distribution over the state action pairs denoted by . Since is a non convex optimization problem we instead solve the equivalent convex problem given by
(199) | |||||
Here, is the matrix of sampled state action pairs at iteration , is the vector of target values at iteration . is the set of diagonal matrices obtained from line 2 of Algorithm 2 and (Note that we are treating as a vector here for notational convenience instead of a matrix as was done in Section 4).
The constraint in Equation (199) ensures that the all the co-ordinates of the vector are upper bounded by (since all elements of are between and ). This ensures that the corresponding neural network represented by Equation (10) is also upper bounded by . We use the a projected gradient descent to solve the constrained convex optimization problem which can be written as.
(200) |
From Ang, Andersen(2017). “Continuous Optimization” [Notes]. https://angms.science/doc/CVX we have that if the step size , after iterations of the projected gradient descent algorithm we obtain
(201) |
Where is the lipschitz constant of and is the parameter estimate at step .
Therefore if the number of iteration of the projected gradient descent algorithm and the step-size satisfy
(202) | |||||
(203) |
we have
(204) |
Let , be defined as
(205) | |||
(206) |
where is defined in Equation (49).
Further, we define and as
(207) | |||
(208) |
Since , then by Lemma 12, we have
(209) |
Note that is a constant value. Thus we can always find constant such that
(210) |
(211) |
Therefore if we have
(212) | |||||
(213) |
then we have
(214) |
which according to Equation (211) implies that
(215) |
Dividing Equation (215) by we get
(216) |
Which implies
(217) |
Assuming is small enough such that from lemma 13, this implies that there exists an such that
(218) | |||||
(219) |
for all . Equation (219) implies that if
(220) | |||||
(221) |
then we have
(222) |
By definition in section E is our estimate of the function at the iteration of Algorithm and thus we have which implies that
(223) |
If we replace by in Equation (222), we get that if
(224) | |||||
(225) |
we have
(226) |
From Assumption 2, we have that
(227) |
where and by definition of in Definition 7, we have that . Therefore if we have
(228) | |||||
(229) |
we have
(231) | |||||
This implies
(232) |