Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate
Abstract
Generative adversarial imitation learning (GAIL) demonstrates tremendous success in practice, especially when combined with neural networks. Different from reinforcement learning, GAIL learns both policy and reward function from expert (human) demonstration. Despite its empirical success, it remains unclear whether GAIL with neural networks converges to the globally optimal solution. The major difficulty comes from the nonconvex-nonconcave minimax optimization structure. To bridge the gap between practice and theory, we analyze a gradient-based algorithm with alternating updates and establish its sublinear convergence to the globally optimal solution. To the best of our knowledge, our analysis establishes the global optimality and convergence rate of GAIL with neural networks for the first time.
1 Introduction
The goal of imitation learning (IL) is to learn to perform a task based on expert demonstration (Ho and Ermon, 2016). In contrast to reinforcement learning (RL), the agent only has access to the expert trajectories but not the rewards. The most straightforward approach of IL is behavioral cloning (BC) (Pomerleau, 1991). BC treats IL as the supervised learning problem of predicting the actions based on the states. Despite its simplicity, BC suffers from the compounding errors caused by covariate shift (Ross et al., 2011; Ross and Bagnell, 2010). Another approach of IL is inverse reinforcement learning (IRL) (Russell, 1998; Ng and Russell, 2000; Levine and Koltun, 2012; Finn et al., 2016), which jointly learns the reward function and the corresponding optimal policy. IRL formulates IL as a bilevel optimization problem. Specifically, IRL solves an RL subproblem given a reward function at the inner level and searches for the reward function which makes the expert policy optimal at the outer level. However, IRL is computationally inefficient as it requires fully solving an RL subproblem at each iteration of the outer level. Moreover, the desired reward function may be nonunique. To address such issues of IRL, Ho and Ermon (2016) propose generative adversarial imitation learning (GAIL), which searches for the optimal policy without fully solving an RL subproblem given a reward function at each iteration. GAIL solves IL via minimax optimization with alternating updates. In particular, GAIL alternates between (i) minimizing the discrepancy in expected cumulative reward between the expert policy and the learned policy and (ii) maximizing such a discrepancy over the reward function class. Such an alternating update scheme mirrors the training of generative adversarial networks (GANs) (Goodfellow et al., 2014; Arjovsky et al., 2017), where the learned policy acts as the generator while the reward function acts as the discriminator.
Incorporated with neural networks, which parameterize the learned policy and the reward function, GAIL achieves significant empirical success in challenging applications, such as natural language processing (Yu et al., 2016), autonomous driving (Kuefler et al., 2017), human behavior modeling (Merel et al., 2017), and robotics (Tai et al., 2018). Despite its empirical success, GAIL with neural networks remains less understood in theory. The major difficulty arises from the following aspects: (i) GAIL involves minimax optimization, while the existing analysis of policy optimization with neural networks (Anthony and Bartlett, 2009; Liu et al., 2019; Bhandari and Russo, 2019; Wang et al., 2019) only focuses on a minimization or maximization problem. (ii) GAIL with neural networks is nonconvex-nonconcave, and therefore, the existing analysis of convex-concave optimization with alternating updates is not applicable (Nesterov, 2013). There is an emerging body of literature (Rafique et al., 2018; Zhang et al., 2019b) that casts nonconvex-nonconcave optimization as bilevel optimization, where the inner level is solved to a high precision as in IRL. However, such analysis is not applicable to GAIL as it involves alternating updates.
In this paper, we bridge the gap between practice and theory by establishing the global optimality and convergence of GAIL with neural networks. Specifically, we parameterize the learned policy and the reward function with two-layer neural networks and consider solving GAIL by alternatively updating the learned policy via a step of natural policy gradient (Kakade, 2002; Peters and Schaal, 2008) and the reward function via a step of gradient ascent. In particular, we parameterize the state-action value function (also known as the Q-function) with a two-layer neural network and apply a variant of the temporal difference algorithm (Sutton and Barto, 2018) to solve the policy evaluation subproblem in natural policy gradient. We prove that the learned policy converges to the expert policy at a rate in the -distance (Chen et al., 2020), which is defined as Here is the expected cumulative reward of a policy given a reward function and is the reward function class. The core of our analysis is constructing a potential function that tracks the -distance. Such a rate of convergence implies that the learned policy (approximately) outperforms the expert policy given any reward function within a finite number of iterations . In other words, the learned policy is globally optimal. To the best of our knowledge, our analysis establishes the global optimality and convergence of GAIL with neural networks for the first time. It is worth mentioning that our analysis is straightforwardly applicable to linear and tabular settings, which, however, are not our focus.
Related works. Our work extends an emerging body of literature on RL with neural networks (Xu et al., 2019a; Zhang et al., 2019a; Bhandari and Russo, 2019; Liu et al., 2019; Wang et al., 2019; Agarwal et al., 2019) to IL. This line of research analyzes the global optimality and convergence of policy gradient for solving RL, which is a minimization or maximization problem. In contrast, we analyze GAIL, which is a minimax optimization problem.
Our work is also related to the analysis of apprenticeship learning (Syed et al., 2008) and GAIL (Cai et al., 2019a; Chen et al., 2020). Syed et al. (2008) analyze the convergence and generalization of apprenticeship learning. They assume the state space to be finite, and thus, do not require function approximation for the policy and the reward function. In contrast, we assume the state space to be infinite and employ function approximation based on neural networks. Cai et al. (2019a) study the global optimality and convergence of GAIL in the setting of linear-quadratic regulators. In contrast, our analysis handles general MDPs without restrictive assumptions on the transition kernel and the reward function. Chen et al. (2020) study the convergence and generalization of GAIL in the setting of general MDPs. However, they only establish the convergence to a stationary point. In contrast, we establish the global optimality of GAIL.
Notations. Let for and for . Also, let be the Gaussian distribution with mean and covariance . We denote by the set of all probability measures over the space . For a function , a constant , and a probability measure , we denote by the norm of the function . For any two functions , we denote by the inner product on the space .
2 Background
In this section, we introduce reinforcement learning (RL) and generative adversarial imitation learning (GAIL).
2.1 Reinforcement Learning
We consider a Markov decision process (MDP) . Here is the state space, is the action space, which is assumed to be finite throughout this paper, is the reward function, is the transition kernel, is the initial state distribution, and is the discount factor. Without loss of generality, we assume that is compact and that for any , where . An agent following a policy interacts with the environment in the following manner. At the state , the agent takes the action with probability and receives the reward . The environment then transits into the next state with probability . Given a policy and a reward function , we define the state-action value function as follows,
(2.1) |
Here the expectation is taken with respect to and . Correspondingly, we define the state value function and the advantage function as follows,
The goal of RL is to maximize the following expected cumulative reward,
(2.2) |
The policy induces a state visitation measure and a state-action visitation measure , which take the forms of
(2.3) |
It then holds that . Meanwhile, we assume that the policy induces a state stationary distribution over , which satisfies that
We denote by the state-action stationary distribution over .
2.2 Generative Adversarial Imitation Learning
The goal of imitation learning (IL) is to learn a policy that outperforms the expert policy based on the trajectory of . We denote by and the state-action and state visitation measures induced by the expert policy, respectively, and assume that the expert trajectory is drawn from . To this end, we parameterize the policy and the reward function by for and for , respectively, and solve the following minimax optimization problem known as GAIL (Ho and Ermon, 2016),
(2.4) |
Here is the expected cumulative reward defined in (2.2), is the regularizer, and is the regularization parameter. Given a reward function class , we define the -distance between two policies and as follows,
(2.5) |
When is the class of -Lipschitz functions, is the Wasserstein- metric between the state-action visitation measures induced by and . However, is not a metric in general. When , the policy outperforms the policy for any reward function . Such a notion of -distance is used in Chen et al. (2020). We denote by the reward function class parameterized with . Hence, the optimization problem in (2.4) minimizes the -distance (up to the regularizer ), which searches for a policy that (approximately) outperforms the expert policy given any reward function .
3 Algorithm
In this section, we introduce an algorithm with alternating updates for GAIL with neural networks, which employs natural policy gradient (NPG) to update the policy and gradient ascent to update the reward function .
3.1 Parameterization with Neural Networks
We define a two-layer neural network with rectified linear units (ReLU) as follows,
(3.1) |
Here is the width of the neural network, and are the parameters, and is called the feature vector in the sequel, where
(3.2) |
It then holds that . Note that the feature vector depends on the parameters and . We consider the following random initialization,
(3.3) |
Throughout the training process, we keep the parameter fixed while updating . For notational simplicity, we write as and as in the sequel. We denote by the expectation taken with respect to the random initialization in (3.3). For an absolute constant , we define the parameter domain as
(3.4) |
which is the ball centered at with the domain radius .
In the sequel, we consider the following energy-based policy,
(3.5) |
where is the inverse temperature parameter and is the energy function parameterized by the neural network defined in (3.1) with . In the sequel, we call the policy parameter. Meanwhile, we parameterize the reward function as follows,
(3.6) |
where is the neural network defined in (3.1) with and is the discount factor. Here we use the scaling parameter to ensure that for any policy the state-action value function defined in (2.1) is well approximated by the neural network defined in (3.1). In the sequel, we call the reward parameter and define the reward function class as
where is the parameter domain of defined in (3.4) with domain radius . To facilitate algorithm design, we establish the following proposition, which calculates the explicit expressions of the gradients and the Fisher information . Recall that the Fisher information is defined as
(3.7) |
Proposition 3.1 (Gradients and Fisher Information).
Proof.
See §C.1 for a detailed proof. ∎
Note that the expression of the policy gradient in (3.10) of Proposition 3.1 involves the state-action value function . To this end, we estimate the state-action value function by , which is parameterized as follows,
(3.12) |
Here is the neural network defined in (3.1) with . In the sequel, we call the value parameter.
3.2 GAIL with Alternating Updates
We employ an actor-critic scheme with alternating updates of the policy and the reward function, which is presented in Algorithm 1. Specifically, we update the policy parameter via natural policy gradient and update the reward parameter via gradient ascent in the actor step, while we estimate the state-action value function via neural temporal difference (TD) (Cai et al., 2019c) in the critic step.
Actor Step. In the -th actor step, we update the policy parameter and the reward parameter as follows,
(3.13) | ||||
(3.14) |
where
(3.15) |
Here is the stepsize, and are the parameter domains of and defined in (3.4) with domain radii and , respectively, is the projection operator, is the inverse temperature parameter of , and are the estimators of , respectively, which are defined in the sequel. In (3.13), we update the policy parameter along the direction , which approximates the natural policy gradient , and in (3.15) we update the inverse temperature parameter to ensure that . Meanwhile, in (3.14), we update the reward parameter via (projected) gradient ascent. Following from and (3.10) of Proposition 3.1, we construct the estimators of and as follows,
(3.16) | ||||
(3.17) |
where is sampled from the state-action visitation measure given with the batch size , and is the estimator of computed in the critic step. Meanwhile, following from (3.11) of Proposition 3.1, we construct the estimator of as follows,
(3.18) |
where is the expert trajectory. For notational simplicity, we write , , and for the -th step hereafter, where is the policy, is the reward function, and are the visitation measures defined in (2.3).
Critic Step. Note that the estimator in (3.17) involves the estimator of . To this end, we parameterize as in (3.12) and adapt neural TD (Cai et al., 2019c), which solves the following minimization problem,
(3.19) |
Here is the parameter domain with domain radius , is the state-action stationary distribution induced by , and is the Bellman operator. Note that the Bellman operator is defined as follows,
where the expectation is taken with respect to and . In neural TD, we iteratively update the value parameter via
(3.20) |
where is the Bellman residual, is the stepsize, is sampled from the state-action stationary distribution , and are the subsequent state and action. We defer the detailed discussion of neural TD to §B.
To approximately obtain the compatible function approximation (Sutton et al., 2000; Wang et al., 2019), we share the random initialization among the policy , the reward function , and the state-action value function . In other words, we set in our algorithm, where is the random initialization in (3.3). The output of GAIL is the mixed policy (Altman, 1999). Specifically, the mixed policy of is executed by randomly selecting a policy for with equal probability before time and exclusively following thereafter. It then holds for any reward function that
(3.21) |
4 Main Results
In this section, we first present the assumptions for our analysis. Then, we establish the global optimality and convergence of Algorithm 1.
4.1 Assumptions
We impose the following assumptions on the stationary distributions and the visitation measures .
Assumption 4.1.
We assume that the following properties hold.
-
(a)
Let be either or . We assume for an absolute constant that
-
(b)
We assume for an absolute constant that
Here , , , and are the Radon-Nikodym derivatives.
Assumption (a) (a) holds when the probability density functions of and are uniformly upper bounded across . Assumption (b) (b) states that the concentrability coefficients are uniformly upper bounded across , which is commonly used in the analysis of RL (Szepesvári and Munos, 2005; Munos and Szepesvári, 2008; Antos et al., 2008; Farahmand et al., 2010; Scherrer et al., 2015; Farahmand et al., 2016; Lazaric et al., 2016).
For notational simplicity, we write and , where is the neural network defined in (3.1) with , is the feature vector defined in (3.2) with , and is the random initialization in (3.3). We impose the following assumptions on the neural network and the transition kernel .
Assumption 4.2.
We assume that the following properties hold.
-
(a)
Let . We assume for absolute constants and that
(4.1) -
(b)
We assume that the transition kernel belongs to the following class,
Here is an absolute constant, is the probability density function of , and is defined as
Assumption 4.2 (b) states that the MDP belongs to (a variant of) the class of linear MDPs (Yang and Wang, 2019a, b; Jin et al., 2019; Cai et al., 2019b). However, our class of transition kernels is infinite-dimensional, and thus, captures a rich class of MDPs. To understand Assumption 4.2 (a), recall that we initialize the neural network with and for any . Thus, the neural network defined in (3.1) with converges to a Gaussian process indexed by as goes to infinity. Following from the facts that the maximum of a Gaussian process over a compact index set is sub-Gaussian (van de Geer and Muro, 2014) and that is compact, it is reasonable to assume that is sub-Gaussian, which further implies the existence of the absolute constants and in (4.1) of Assumption 4.2 (a). Moreover, if we assume that is even and initialize the parameters as follows,
(4.2) |
we have that for any , which allows us to set and in Assumption (a) (a). Also, it holds that , which implies that for any and . The proof of our results with the random initialization in (4.2) is identical.
Finally, we impose the following assumption on the regularizer and the variances of the estimators , , and defined in (3.16), (3.17), and (3.18), respectively.
Assumption 4.3.
We assume that the following properties hold.
- (a)
-
(b)
We assume that the regularizer in (2.4) is convex and -Lipschitz continuous over the compact parameter domain .
Assumption 4.3 (a) holds when , , and have uniformly upper bounded variances across and , and the Markov chain that generates mixes sufficiently fast (Zhang et al., 2019a). Similar assumptions are also used in the analysis of policy optimization (Xu et al., 2019a, b). Also, Assumption 4.3 (b) holds for most commonly used regularizers (Ho and Ermon, 2016).
4.2 Global Optimality and Convergence
In this section, we establish the global optimality and convergence of Algorithm 1. The following proposition adapted from Cai et al. (2019c) characterizes the global optimality and convergence of neural TD, which is presented in Algorithm 2.
Proposition 4.4 (Global Optimality and Convergence of Neural TD).
Proof.
See §B.1 for a detailed proof. ∎
The term in (4.6) of Proposition 4.4 characterizes the hardness of estimating the state-action value function by the neural network defined in (3.1), which arises because is not uniformly upper bounded across . Note that if we employ the random initialization in (4.2), we have that . And consequently, such a term vanishes. We are now ready to establish the global optimality and convergence of Algorithm 1.
Theorem 4.5 (Global Optimality and Convergence of GAIL).
We set and for an absolute constant , and in Algorithm 1. Let be the output of Algorithm 1. Under Assumptions 4.1-4.3, it holds that
(4.7) |
Here , is the -distance defined in (2.5) with , the expectation is taken with respect to the random initialization in (3.3) and the batches, and the error term satisfies that
(4.8) |
where is defined in Assumption 4.1, and are defined in Assumption (b), and is the error induced by neural TD (Algorithm 2).
Proof.
See §5 for a detailed proof. ∎
The optimality gap in (4.7) of Theorem 4.5 is measured by the expected -distance between the expert policy and the learned policy . Thus, by showing that the optimality gap is upper bounded by , we prove that (approximately) outperforms the expert policy in expectation when the number of iterations goes to infinity. As shown in (4.7) of Theorem 4.5, the optimality gap is upper bounded by the sum of the three terms. Term (i) corresponds to the rate of convergence of Algorithm 1. Term (ii) corresponds to the bias induced by the regularizer in the objective function defined in (2.4). Term (iii) is upper bounded by the sum of the three terms in (4.8) of Theorem 4.5. In detail, term (iii.a) corresponds to the error induced by the variances of , , and defined in (4.3), (4.4), and (4.5) of Assumption 4.3, which vanishes as the batch size in Algorithm 1 goes to infinity. Term (iii.b) is the error of estimating by using neural TD (Algorithm 2). As shown in Proposition 4.4, the estimation error vanishes as and go to infinity. Term (iii.c) corresponds to the linearization error of the neural network defined in (3.1), which is characterized in Lemma A.2. Following from Theorem 4.5, it holds for , , and that , which implies the rate of convergence of Algorithm 1 (up to the bias induced by the regularizer).
5 Proof of Main Results
In this section, we present the proof of Theorem 4.5, which establishes the global optimality and convergence of Algorithm 1. For notational simplicity, we write for any policy , state , and action . For any policies and distribution over , we denote the expected Kullback-Leibler (KL) divergence by , which is defined as . For any visitation measures and , we denote by and the expectations taken with respect to and , respectively.
Following from the property of the mixed policy in (3.21), we have that
(5.1) |
We now upper bound the optimality gap in (5) by upper bounding the following difference of expected cumulative rewards,
(5.2) |
where is chosen arbitrarily and is the objective function defined in (2.4). Following from Assumption 4.3 and the fact that , we have that
(5.3) |
which upper bounds term (iii) of (5.2). It remains to upper bound terms (i) and (ii) of (5.2), which hinges on the one-point convexity of with respect to and the (approximate) convexity of with respect to . Upper bound of term (i) in (5.2). In what follows, we upper bound term (i) of (5.2). We first introduce the following cost difference lemma (Kakade and Langford, 2002), which corresponds to the one-point convexity of with respect to . Recall that is the state visitation measure induced by the expert policy .
Lemma 5.1 (Cost Difference Lemma, Lemma 6.1 in Kakade and Langford (2002)).
For any policy and reward function , it holds that
(5.4) |
where is the discount factor.
Furthermore, we establish the following lemma, which upper bounds the right-hand side of (5.4) in Lemma 5.1.
Lemma 5.2.
Proof.
See §C.2 for a detailed proof. ∎
Combining Lemmas 5.1 and 5.2, we have that
(5.6) |
which upper bounds term (i) of (5.2). Here is upper bounded in (5.2) of Lemma 5.2.
Upper bound of term (ii) in (5.2). In what follows, we upper bound term (ii) of (5.2). We first establish the following lemma, which characterizes the (approximate) convexity of with respect to .
Lemma 5.3.
Under Assumption (a), it holds for any that
Proof.
See §C.3 for a detailed proof. ∎
The term in Lemma 5.3 arises from the linearization error of the neural network, which is characterized in Lemma A.2. Based on Lemma 5.3, we establish the following lemma, which upper bounds term (ii) of (5.2).
Proof.
See §C.4 for a detailed proof. ∎
By Lemma 5.4, we have that
(5.8) |
which upper bounds term (ii) of (5.2). Here is upper bounded in (5.7) of Lemma 5.4.
Plugging (5.3), (5.6), and (5.8) into (5.2), we obtain that
(5.9) | ||||
Here , where and are upper bounded in (5.2) and (5.7) of Lemmas 5.2 and 5.4, respectively. Note that the upper bound of does not depend on and . Upon telescoping (5.9) with respect to , we obtain that
(5.10) | ||||
Following from the fact that and the parameterization of in (3.5), it holds that is the uniform distribution over for any . Thus, we have . Meanwhile, following from the fact that , it holds that . Finally, by setting , , and in (5.10), we have that
Here is upper bounded as follows,
where for an absolute constant . Thus, we complete the proof of Theorem 4.5.
References
- Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in Markov decision processes. arXiv preprint arXiv:1908.00261.
- Altman (1999) Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
- Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural network learning: Theoretical foundations. Cambridge University Press.
- Antos et al. (2008) Antos, A., Szepesvári, C. and Munos, R. (2008). Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71 89–129.
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S. and Bottou, L. (2017). Wasserstein GAN. arXiv preprint arXiv:1701.07875.
- Bhandari and Russo (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Cai et al. (2019a) Cai, Q., Hong, M., Chen, Y. and Wang, Z. (2019a). On the global convergence of imitation learning: A case for linear quadratic regulator. arXiv preprint arXiv:1901.03674.
- Cai et al. (2019b) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019b). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
- Cai et al. (2019c) Cai, Q., Yang, Z., Lee, J. D. and Wang, Z. (2019c). Neural temporal-difference learning converges to global optima. arXiv preprint arXiv:1905.10027.
- Chen et al. (2020) Chen, M., Wang, Y., Liu, T., Yang, Z., Li, X., Wang, Z. and Zhao, T. (2020). On computation and generalization of generative adversarial imitation learning. arXiv preprint arXiv:2001.02792.
- Farahmand et al. (2016) Farahmand, A.-m., Ghavamzadeh, M., Szepesvári, C. and Mannor, S. (2016). Regularized policy iteration with nonparametric function spaces. The Journal of Machine Learning Research, 17 4809–4874.
- Farahmand et al. (2010) Farahmand, A.-m., Szepesvári, C. and Munos, R. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems.
- Finn et al. (2016) Finn, C., Levine, S. and Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning.
- Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A. and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems.
- Ho and Ermon (2016) Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Advances in Neural Information Processing Systems.
- Hofmann et al. (2008) Hofmann, T., Schölkopf, B. and Smola, A. J. (2008). Kernel methods in machine learning. The Annals of Statistics 1171–1220.
- Jin et al. (2019) Jin, C., Yang, Z., Wang, Z. and Jordan, M. I. (2019). Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388.
- Kakade and Langford (2002) Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, vol. 2.
- Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems.
- Kuefler et al. (2017) Kuefler, A., Morton, J., Wheeler, T. and Kochenderfer, M. (2017). Imitating driver behavior with generative adversarial networks. In IEEE Intelligent Vehicles Symposium. IEEE.
- Lazaric et al. (2016) Lazaric, A., Ghavamzadeh, M. and Munos, R. (2016). Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17 583–612.
- Levine and Koltun (2012) Levine, S. and Koltun, V. (2012). Continuous inverse optimal control with locally optimal examples. arXiv preprint arXiv:1206.4617.
- Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Merel et al. (2017) Merel, J., Tassa, Y., TB, D., Srinivasan, S., Lemmon, J., Wang, Z., Wayne, G. and Heess, N. (2017). Learning human behaviors from motion capture by adversarial imitation. arXiv preprint arXiv:1707.02201.
- Munos and Szepesvári (2008) Munos, R. and Szepesvári, C. (2008). Finite-time bounds for fitted value iteration. The Journal of Machine Learning Research, 9 815–857.
- Nesterov (2013) Nesterov, Y. (2013). Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media.
- Ng and Russell (2000) Ng, A. Y. and Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In International Conference on Machine Learning.
- Peters and Schaal (2008) Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71 1180–1190.
- Pomerleau (1991) Pomerleau, D. A. (1991). Efficient training of artificial neural networks for autonomous navigation. Neural Computation, 3 88–97.
- Rafique et al. (2018) Rafique, H., Liu, M., Lin, Q. and Yang, T. (2018). Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv:1810.02060.
- Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems.
- Rahimi and Recht (2009) Rahimi, A. and Recht, B. (2009). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Advances in Neural Information Processing Systems 1313–1320.
- Ross and Bagnell (2010) Ross, S. and Bagnell, D. (2010). Efficient reductions for imitation learning. In International Conference on Artificial Intelligence and Statistics.
- Ross et al. (2011) Ross, S., Gordon, G. and Bagnell, D. (2011). A reduction of imitation learning and structured prediction to no-regret online learning. In International Conference on Artificial Intelligence and Statistics.
- Russell (1998) Russell, S. (1998). Learning agents for uncertain environments. In Conference on Learning Theory.
- Scherrer et al. (2015) Scherrer, B., Ghavamzadeh, M., Gabillon, V., Lesner, B. and Geist, M. (2015). Approximate modified policy iteration and its application to the game of Tetris. The Journal of Machine Learning Research, 16 1629–1676.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT Press.
- Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P. and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems.
- Syed et al. (2008) Syed, U., Bowling, M. and Schapire, R. E. (2008). Apprenticeship learning using linear programming. In International Conference on Machine Learning.
- Szepesvári and Munos (2005) Szepesvári, C. and Munos, R. (2005). Finite time bounds for sampling based fitted value iteration. In International Conference on Machine Learning. ACM.
- Tai et al. (2018) Tai, L., Zhang, J., Liu, M. and Burgard, W. (2018). Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In IEEE International Conference on Robotics and Automation. IEEE.
- van de Geer and Muro (2014) van de Geer, S. and Muro, A. (2014). On higher order isotropy conditions and lower bounds for sparse quadratic forms. Electronic Journal of Statistics, 8 3031–3061.
- Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Xu et al. (2019a) Xu, P., Gao, F. and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. arXiv preprint arXiv:1905.12615.
- Xu et al. (2019b) Xu, P., Gao, F. and Gu, Q. (2019b). Sample efficient policy gradient methods with recursive variance reduction. arXiv preprint arXiv:1909.08610.
- Yang and Wang (2019a) Yang, L. and Wang, M. (2019a). Sample-optimal parametric Q-learning using linearly additive features. In International Conference on Machine Learning.
- Yang and Wang (2019b) Yang, L. F. and Wang, M. (2019b). Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389.
- Yu et al. (2016) Yu, L., Zhang, W., Wang, J. and Yu, Y. (2016). SeqGAN: Sequence generative adversarial nets with policy gradient. arXiv preprint arXiv:1609.05473.
- Zhang et al. (2019a) Zhang, K., Koppel, A., Zhu, H. and Başar, T. (2019a). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
- Zhang et al. (2019b) Zhang, K., Yang, Z. and Başar, T. (2019b). Policy optimization provably converges to Nash equilibria in zero-sum linear quadratic games. arXiv preprint arXiv:1906.00729.
Appendix A Neural Networks
In what follows, we present the properties of the neural network defined in (3.1). First, we define the following function class.
Definition A.1 (Function Class).
As shown in Rahimi and Recht (2008), the feature induces a reproducing kernel Hilbert space (RKHS), namely . When goes to infinity, approximates a ball in , which captures a rich class of functions (Hofmann et al., 2008; Rahimi and Recht, 2008). Furthermore, we obtain the following lemma from Cai et al. (2019c), which characterizes the linearization error of the neural network defined in (3.1).
Lemma A.2 (Linearization Error, Lemma 5.1 in Cai et al. (2019c)).
Proof.
See §A.1 for a detailed proof. ∎
Following from Lemma A.2, the function class defined in Definition A.1 is a first-order approximation of the class of the neural networks defined in (3.1). Meanwhile, we establish the following lemma to characterize the sub-Gaussian property of the neural network defined in (3.1).
Lemma A.3.
Proof.
See §A.2 for a detailed proof. ∎
A.1 Proof of Lemma A.2
Proof.
We consider any . By the definition of in (3.2) and the triangle inequality, we have that
(A.1) |
We now upper bound the right-hand side of (A.1). For the term in (A.1), we have that
(A.2) |
where the first inequality follows from the triangle inequality and the second inequality follows from the Cauchy-Schwartz inequality and the fact that . To upper bound the term on the right-hand side of (A.1), note that implies that
Thus, we have that
(A.3) |
Plugging (A.1) and (A.3) into (A.1), we have that
By the fact that , we obtain that
By setting in Assumption (a), we have that
Taking the expectation with respect to the random initialization in (3.3) and using the Cauchy-Schwartz inequality, we have that
where the second inequality follows from the fact that , the third inequality follows from Jensen’s inequality, and the last inequality follows from Assumption (a) and Lemma A.2. Thus, for any , we have that
Moreover, following from the Cauchy-Schwartz inequality, we have that . Thus, by Jensen’s inequality, we have that
which completes the proof of Lemma A.2. ∎
A.2 Proof of Lemma A.3
In what follows, we present the proof of Lemma A.3.
Proof.
Recall that we write and . Then, we have
(A.4) |
where the last inequality follows from the Cauchy-Schwartz inequality. It suffices to upper bound the three terms on the right-hand side of (A.2). Note that we have and . We have that
(A.5) |
It remains to upper bound the term in (A.2). Note that is almost everywhere differentiable with respect to . Also, it holds that . Thus, following from the mean-value theorem and the Cauchy-Schwartz inequality, we have that
(A.6) |
where the second inequality follows from the fact that and . Plugging (A.5) and (A.6) into (A.2), we have that
Following from Assumption 4.2, we have that is sub-Gaussian. Furthermore, it holds that
and that
for . Thus, we complete the proof of Lemma A.3. ∎
Appendix B Neural Temporal Difference
In this section, we introduce neural TD (Cai et al., 2019c), which computes in Algorithm 1. Specifically, neural TD solves the optimization problem in (3.19) via the update in (3.2), which is presented in Algorithm 2.
B.1 Proof of Proposition 4.4
Proof.
We obtain the following proposition from Cai et al. (2019c), which characterizes the convergence of Algorithm 2.
Proposition B.1 (Proposition 4.7 in Cai et al. (2019c)).
Recall that we denote by the feature vector corresponding to the random initialization in (3.3). We establish the following lemma to upper bound the bias in (B.1) of Proposition B.1 when the reward function belongs to the reward function class .
Lemma B.2.
Proof.
See §B.2 for a detailed proof. ∎
B.2 Proof of Lemma B.2
Proof.
For notational simplicity, we write . Under Assumption 4.2, we have that
(B.2) |
Thus, since , we have that
where the second equality follows from (B.2) and the last equality follows from Fubini’s theorem. In the sequel, we define
(B.3) |
Note that . Then, we have that
To prove Lemma B.2, we first approximate by
(B.4) |
where for an absolute constant specified later. Then, it holds for any that
where the second inequality follows from the Cauchy-Schwartz inequality and the last inequality follows from the fact that . Thus, we have that
(B.5) |
We now upper bound the right-hand side of (B.5). To this end, we show that is sub-Gaussian in the sequel. By the definition of in (B.3), we have that
(B.6) |
where the second inequality follows from Assumption 4.2 and the third inequality follows from the fact that . Here we denote by the state-action visitation measure starting from the state and following the policy . Following from Lemma A.3, we have that is sub-Gaussian. By Lemma A.3 and (B.2), it holds for that
(B.7) |
Let the absolute constant satisfy that in (B.2). For notational simplicity, we write . By the fact that , we have that
Following from (B.5) and (B.2), we have that
(B.8) |
We now construct , which approximates defined in (B.4). We define
Then, we have that . Note that belongs to the following function class,
We now show that is well approximated by the following function class,
where is the feature vector corresponding to the random initialization. We obtain the following lemma from Rahimi and Recht (2009), which characterizes the approximation error of by .
Lemma B.3 (Lemma 1 in Rahimi and Recht (2009)).
For any , it holds with probability at least that
where .
Appendix C Proofs of Auxiliary Results
C.1 Proof of Proposition 3.1
Proof.
By the definition of the neural network in (3.1), we have for any that almost everywhere. We first calculate . Following from the policy gradient theorem (Sutton and Barto, 2018) and the definition of in (2.4), we have that
(C.1) |
Following from the parameterization of in (3.5) and the definition of in (3.8) of Proposition 3.1, we have that
(C.2) |
Plugging (C.1) into (C.1), we have that
It remains to calculate and . By (C.1) and the definition of in (3.7), it holds that
By the definition of the objective function in (2.4), it holds that
Thus, we complete the proof of Proposition 3.1. ∎
C.2 Proof of Lemma 5.2
Proof.
The proof of Lemma 5.2 is similar to that of Lemmas 5.4 and 5.5 in Wang et al. (2019). By direct calculation, we have that
where takes the form of
(C.3) |
The following lemmas upper bound by upper bounding terms (i.a), (i.b), and (i.c) on the right-hand side of (C.2), respectively. Note that the expectation is taken with respect to the random initialization in (3.3) and .
Lemma C.1 (Upper Bound of Term (i.a) in (C.2)).
Proof.
See §D.1 for a detailed proof. ∎
Lemma C.2 (Upper Bound of Term (i.b) in (C.2)).
Proof.
See §D.2 for a detailed proof. ∎
Lemma C.3 (Upper Bound of Term (i.c) in (C.2)).
Proof.
See §D.3 for a detailed proof. ∎
Finally, by Lemmas C.1-C.3, under Assumptions 4.2 and 4.3, we obtain from (C.2) that
Here is defined in Assumption (a), is the inverse temperature parameter of defined in (3.5), is defined in Assumption 4.3, and is defined in (C.4) of Lemma C.2. Following from Proposition 4.4, we have that
Thus, we complete the proof of Lemma 5.2. ∎
C.3 Proof of Lemma 5.3
Proof.
We consider a fixed . For notational simplicity, we write , and . By the parameterization of in (3.6), we have that
(C.5) |
where the last inequality follows from (3.10) of Proposition 3.1. Then, we have that
where the last inequality follows from Assumption (a), Lemma A.2, and the fact that . Thus, we complete the proof of Lemma 5.3. ∎
C.4 Proof of Lemma 5.4
Proof.
By the update of in (3.14), it holds for any that
which further implies that
(C.6) | ||||
Combining (C.3) and (C.6), we have that
where takes the form of
(C.7) |
We now upper bound terms (ii.a), (ii.b), and (ii.c) on the right-hand side of (C.4). Following from Assumption (a) and Lemma A.2, we have that
(C.8) |
which upper bounds term (ii.c) of (C.4). For term (ii.b) of (C.4), we have that
(C.9) |
where we write . Here the first inequality follows from the Cauchy-Schwartz inequality, the second inequality follows from the fact that , and the last inequality follows from Assumption 4.3. To upper bound term (ii.a) in (C.4), we have that
(C.10) | ||||
where the first inequality follows from the Cauchy-Schwartz inequality and the second inequality follows from the update of in (3.14). Furthermore, we have
(C.11) |
where the first inequality follows from Jensen’s inequality and the second inequality follows from the fact that and the Lipschitz continuity of in Assumption (b). By plugging (C.4) into (C.10), we have that
(C.12) |
where the last inequality follows from Assumption 4.3. Finally, by plugging (C.8), (C.4), and (C.4) into (C.4), we have that
Thus, we complete the proof of Lemma 5.4. ∎
Appendix D Proofs of Supporting Lemmas
In what follows, we present the proofs of the lemmas in §C.
D.1 Proof of Lemma C.1
Proof.
It holds for any policies that
(D.1) |
where only depends on the state . Thus, we have that
where the first inequality follows from the parameterization of and in (3.5) and (3.12), respectively, and the second equality follows from the definition of the temperature-adjusted score function in (3.8) of Proposition 3.1. Here, with a slight abuse of the notation, we define
(D.2) |
Then, following from (D.1) and the update in (3.13), we have that
(D.3) | ||||
In what follows, we upper bound terms (i) and (ii) on the right-hand side of (D.3).
Upper bound of term (i) in (D.3). Following from (3.8) of Proposition 3.1 and (D.1) we have that
(D.4) |
where the inequality follows from the triangle inequality. Following from Assumption (a) and Lemma A.2, we have that
(D.5) |
Furthermore, following from Assumption (b), Lemma A.2, and the Cauchy-Schwartz inequality, we have that
(D.6) |
Here the expectation is taken with respect to the random initialization in (3.3) and . Thus, plugging (D.5) and (D.6) into (D.1), we obtain for term (i) of (D.3) that
(D.7) |
Upper bound of term (ii) in (D.3). Following from the Cauchy-Schwartz inequality, we have that
(D.8) |
Similarly, we have that
(D.9) |
where the last inequality follows from the Cauchy-Schwartz inequality. Combining (D.1) and (D.1), we obtain for term (ii) of (D.3) that
(D.10) |
where the last inequality follows from Assumption 4.1. To upper bound term (ii) of (D.3), it suffices to upper bound the right-hand side of (D.1). For notational simplicity, we write , , and . By the triangle inequality, we have that
(D.11) |
We now upper bound the two terms (ii.a) and (ii.b) on the right-hand side of (D.1). For term (ii.a) of (D.1), following from (3.9) of Proposition 3.1, we have that
(D.12) |
Recall that the expectation is taken with respect to the -th batch. Following from the definition of in (3.17), we have that
(D.13) |
where the first equality follows from the fact that , while the second and third equalities follow from the definition of in (D.2). Following from (D.12) and (D.1), we have that
(D.14) |
Here the last inequality follows from the Cauchy-Schwartz inequality and the fact that as . For notational simplicity, we define the following error terms,
(D.15) | ||||
(D.16) |
Then, we have for term (ii.a) in (D.1) that
(D.17) | ||||
where the first inequality follows from (D.1), the second inequality follows from the triangle inequality, and the last inequality follows from Jensen’s inequality. Similarly to (D.15), we define the following error term,
(D.18) |
We now upper bound the right-hand side of (D.17). Recall the definition of in (3.15). We have that
(D.19) | ||||
Following from (D.12), (D.1), and Jensen’s inequality, we have that
where the last inequality follows from the fact that for any . Following from Assumption (a) and Lemma A.2, we have that
(D.20) |
Plugging (D.19) and (D.1) into (D.17), we have that
(D.21) |
where the last inequality follows from Assumption 4.3. We now upper bound term (ii.a) of (D.1). We have that
(D.22) |
where the expectation is taken with respect to the random initialization in (3.3) and , the first inequality follows from Jensen’s inequality, and the second inequality follows from the Cauchy-Schwartz inequality. Following from Assumption (a) and Lemma A.2, we have that
(D.23) |
To upper bound the right-hand side of (D.1), it remains to upper bound the term . We have that
(D.24) |
where the inequality follows from the Cauchy-Schwartz inequality and the equality follows from the facts that , , and . Plugging (D.23) and (D.24) into (D.1), we have that
(D.25) |
which upper bounds term (ii.b) of (D.1). Plugging (D.1) and (D.25) into (D.1), following from (D.1), we have that
(D.26) |
which upper bounds term (ii) of (D.3).
D.2 Proof of Lemma C.2
D.3 Proof of Lemma C.3
Proof.
Following from (D.1) and the parameterization of in (3.5), we have that
(D.27) | ||||
We now upper bound the two terms on the right-hand side of (D.27). For the first term on the right-hand side of (D.27), recall that we define in (3.15). Thus, we have that
(D.28) |
Following from (D.28) and Hölder’s inequality, we have for any that
Then, following from Pinsker’s inequality, we have that
(D.29) |
By the update of in (3.13) and the definition of in (3.15), we have that . Thus, by Lemma A.3, we have that
(D.30) |
Plugging (D.30) into (D.3), we have that
(D.31) |
For the second term on the right-hand side of (D.27), following from Assumption (a) and Lemma A.2, we have
(D.32) |
Finally, plugging (D.31) and (D.3) into (D.27), we have that
which completes the proof of Lemma C.3. ∎