This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Some Geometric Behavior of Value Iteration on the Orthant: Switching System Perspective

Donghwan Lee D. Lee is with the Department of Electrical and Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, South Korea [email protected].This material was supported in part by the National Research Foundation under Grant NRF-2021R1F1A1061613 and in part by the Institute of Information communications Technology Planning Evaluation (IITP) grant funded by the Korea government (MSIT)(No.2022-0-00469)
Abstract

In this paper, the primary goal is to offer additional insights into the value iteration through the lens of switching system models in the control community. These models establish a connection between value iteration and switching system theory and reveal additional geometric behaviors of value iteration in solving discounted Markov decision problems. Specifically, the main contributions of this paper are twofold: 1) We provide a switching system model of value iteration and, based on it, offer a different proof for the contraction property of the value iteration. 2) Furthermore, from the additional insights, new geometric behaviors of value iteration are proven when the initial iterate lies in a special region. We anticipate that the proposed perspectives might have the potential to be a useful tool, applicable in various settings. Therefore, further development of these methods could be a valuable avenue for future research.

I Introduction

Dynamic programming [1, 2] is a general and effective methodology for finding an optimal solution for Markov decision problems [3]. Value iteration is a popular class of dynamic programming algorithms, and its exponential convergence is a fundamental and well-established [1, 2] result through the contraction mapping property of the Bellman operator.

In this paper, we offer some additional insights on the value iteration based on switching system models [4, 5, 6, 7, 8, 9, 10, 11] in the control community. These models provide new geometric behaviors of value iteration to solve Markov decision problems [3]. In particular, the main contributions of this paper are twofold: 1) We present a switching system model of value iteration and, based on it, provide a different proof for the contraction property of value iteration. 2) Furthermore, from the new perspectives and frameworks, we study additional geometric behaviors of value iteration when the initial iterate lies in a special region. More specifically, the detailed contributions are summarized as follows:

  1. 1.

    A switching system model of value iteration is introduced along with its special properties. Based on it, we offer a different proof for the following contraction property of the value iteration:

    Qk+1QγQkQ,\displaystyle\left\|{Q_{k+1}-Q^{*}}\right\|_{\infty}\leq\gamma\left\|{Q_{k}-Q^{*}}\right\|_{\infty}, (1)

    where QkQ_{k} is the kkth iterate of the value iteration, QQ^{*} is the optimal Q-function, and γ(0,1)\gamma\in(0,1) is the so-called discount factor in the underlying Markov decision problem. Although this result is fundamental and classical, the proof technique utilized in this paper is distinct from the classical approaches, offering additional insights.

  2. 2.

    Furthermore, from the additional insights provided in the first part, we prove new geometric behaviors of value iteration when the initial iterate lies in a special region. Specifically, when the initial iterate Q0Q_{0} is within the set Q0QQ_{0}\leq Q^{*} (shifted orthant), the iterates of the value iteration remain in the set, displaying different behaviors than the cases with a general initial iterate. When Q0QQ_{0}\leq Q^{*}, one can establish the new contraction property

    Qk+1QM(γ+ε)QkQM\displaystyle\left\|{Q_{k+1}-Q^{*}}\right\|_{M}\leq(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{M} (2)

    where ()M=()TM()\left\|(\cdot)\right\|_{M}=\sqrt{(\cdot)^{T}M(\cdot)} is the weighted Euclidean norm, ε>0\varepsilon>0 is any real number such that γ+ε(0,1)\gamma+\varepsilon\in(0,1), and MM is a positive definite matrix dependent on ε>0\varepsilon>0. This result corresponds to a contraction mapping property of the Bellman operator in terms of the weighted Euclidean norm, which cannot be derived from the classical approaches. Geometrically, the sublevel sets corresponding to the infinity norm \left\|\cdot\right\|_{\infty} are squares, while the sublevel sets corresponding to the weighted Euclidean norm M\left\|\cdot\right\|_{M} are ellipsoids.

    A more notable result is that when Q0QQ_{0}\leq Q^{*}, the iterates satisfy the linear inequality

    (γ+ε)vT(QkQ)vT(Qk+1Q)0,k0,\displaystyle(\gamma+\varepsilon)v^{T}(Q_{k}-Q^{*})\leq v^{T}(Q_{k+1}-Q^{*})\leq 0,\quad\forall k\geq 0, (3)

    where ε>0\varepsilon>0 is any real number such that γ+ε(0,1)\gamma+\varepsilon\in(0,1), and vv is a positive vector dependent on ε\varepsilon. Geometrically, this implies that the component along the direction of vv in QkQQ_{k}-Q^{*} diminishes exponentially.

The analysis presented in this paper employs insights and tools from switching systems [4, 5, 6, 7, 8, 9, 10, 11], positive switching systems [12, 13, 14, 15], nonlinear systems [16], and linear systems [17]. We emphasize that our goal in this paper is to offer additional insights and analysis templates for value iteration via its connections to switching systems, rather than improving existing convergence rates. We anticipate that the proposed switching system perspectives may have the potential to be a useful tool, applicable in various settings, and stimulate the synergy between dynamic system theory and dynamic programming. Thus, further development of these methods could be a valuable avenue for future research. Finally, some ideas in this paper have been inspired by [11, 18], which develop switching system models of Q-learning [19]. However, most technical results and derivation processes are nontrivial and entirely different from those in [11, 18]. Another related work is [20], which provides a linear programming-based analysis of value iteration and also employs positive switching system models for their analysis. However, the previous work merely proposed a numerical linear programming-based analysis to verify convergence properties and did not obtain the explicit relations given in (1), (2) ,(3). Therefore, our analysis offers significantly different results than those in [20].

II Preliminaries

II-A Notations

The adopted notation is as follows: {\mathbb{R}}: set of real numbers; n{\mathbb{R}}^{n}: nn-dimensional Euclidean space; n×m{\mathbb{R}}^{n\times m}: set of all n×mn\times m real matrices; ATA^{T}: transpose of matrix AA; A0A\succ 0 (A0A\prec 0, A0A\succeq 0, and A0A\preceq 0, respectively): symmetric positive definite (negative definite, positive semi-definite, and negative semi-definite, respectively) matrix AA; II: identity matrix with appropriate dimensions; |𝒮||{\cal S}|: cardinality of a finite set 𝒮\cal S; ABA\otimes B: Kronecker’s product of matrices AA and BB; 𝟏\bf 1: the vector with ones in all elements; λmax()\lambda_{\max}(\cdot): maximum eigenvalue; λmin()\lambda_{\min}(\cdot): minimum eigenvalue.

II-B Markov decision problem

We consider the infinite-horizon discounted Markov decision problem (MDP) [3], where the agent sequentially takes actions to maximize cumulative discounted rewards. In a Markov decision process with the state-space 𝒮:=1,2,,|𝒮|{\cal S}:={1,2,\ldots,|{\cal S}|} and action-space 𝒜:=1,2,,|𝒜|{\cal A}:={1,2,\ldots,|{\cal A}|}, the decision maker selects an action a𝒜a\in{\cal A} with the current state ss. The state then transitions to a state ss^{\prime} with probability P(s|s,a)P(s^{\prime}|s,a), and the transition incurs a reward r(s,a,s)r(s,a,s^{\prime}), where rr is a reward function. For convenience, we consider a deterministic reward function and simply write r(sk,ak,sk+1)=:rk,k0r(s_{k},a_{k},s_{k+1})=:r_{k},k\geq 0. A deterministic policy, π:𝒮𝒜\pi:{\cal S}\to{\cal A}, maps a state s𝒮s\in{\cal S} to an action π(s)𝒜\pi(s)\in{\cal A}. The objective of the Markov decision problem (MDP) is to find an optimal (deterministic) policy, π\pi^{*}, such that the cumulative discounted rewards over infinite time horizons are maximized, i.e.,

π:=argmaxπΘ𝔼[k=0γkrk|π],\displaystyle\pi^{*}:=\operatorname{arg\,max}_{\pi\in\Theta}{\mathbb{E}}\left[\left.\sum_{k=0}^{\infty}{\gamma^{k}r_{k}}\right|\pi\right],

where γ[0,1)\gamma\in[0,1) is the discount factor, Θ\Theta is the set of all admissible deterministic policies, (s0,a0,s1,a1,)(s_{0},a_{0},s_{1},a_{1},\ldots) is a state-action trajectory generated by the Markov chain under policy π\pi, and 𝔼[|π]{\mathbb{E}}[\cdot|\pi] represents an expectation conditioned on the policy π\pi. The Q-function under policy π\pi is defined as

Qπ(s,a)=𝔼[k=0γkrk|s0=s,a0=a,π]\displaystyle Q^{\pi}(s,a)={\mathbb{E}}\left[\left.\sum_{k=0}^{\infty}{\gamma^{k}r_{k}}\right|s_{0}=s,a_{0}=a,\pi\right]

for all s𝒮,a𝒜s\in{\cal S},a\in{\cal A}, and the optimal Q-function is defined as Q(s,a)=Qπ(s,a)Q^{*}(s,a)=Q^{\pi^{*}}(s,a) for all s𝒮,a𝒜s\in{\cal S},a\in{\cal A}. Once QQ^{*} is known, then an optimal policy can be retrieved by π(s)=argmaxa𝒜Q(s,a)\pi^{*}(s)=\operatorname{arg\,max}_{a\in{\cal A}}Q^{*}(s,a). Throughout, we make the following standard assumption.

Assumption 1

The reward function is unit bounded as follows:

max(s,a,s)𝒮×𝒜×𝒮|r(s,a,s)|1.\displaystyle\max_{(s,a,s^{\prime})\in{\cal S}\times{\cal A}\times{\cal S}}|r(s,a,s^{\prime})|\leq 1.

The unit bound imposed on rr is just for simplicity of analysis. Under 1, Q-function is bounded as follows.

Lemma 1

We have Q11γ\|Q^{*}\|_{\infty}\leq\frac{1}{1-\gamma}.

Proof:

The proof is straightforward from the definition of Q-function as follows: |Q(s,a)|k=0γk=11γ|Q(s,a)|\leq\sum\limits_{k=0}^{\infty}{\gamma^{k}}=\frac{1}{{1-\gamma}}. ∎

II-C Switching system

Let us consider the switched linear system (SLS) [4, 5, 6, 7, 8, 9, 10, 11],

xk+1=Aσkxk,x0=zn,k{0,1,},\displaystyle x_{k+1}=A_{\sigma_{k}}x_{k},\quad x_{0}=z\in{\mathbb{R}}^{n},\quad k\in\{0,1,\ldots\}, (4)

Where xknx_{k}\in\mathbb{R}^{n} is the state, σ:=1,2,,M\sigma\in\mathcal{M}:={1,2,\ldots,M} is called the mode, σk\sigma_{k}\in\mathcal{M} is called the switching signal, and Aσ,σ{A_{\sigma},\sigma\in\mathcal{M}} are called the subsystem matrices. The switching signal can be either arbitrary or controlled by the user under a certain switching policy. Specifically, a state-feedback switching policy is denoted by σk=σ(xk)\sigma_{k}=\sigma(x_{k}). The analysis and control synthesis of SLSs have been actively studied during the last decades [4, 5, 6, 7, 8, 9, 10, 11]. A more general class of systems is the switched affine system (SAS)

xk+1=Aσkxk+bσk,x0=zn,k{0,1,},\displaystyle x_{k+1}=A_{\sigma_{k}}x_{k}+b_{\sigma_{k}},\quad x_{0}=z\in{\mathbb{R}}^{n},\quad k\in\{0,1,\ldots\},

where bσknb_{\sigma_{k}}\in{\mathbb{R}}^{n} is the additional input vector, which also switches according to σk\sigma_{k}. Due to the additional input bσkb_{\sigma_{k}}, its stabilization becomes much more challenging. Lastly, when all elements of the subsystem matrices, {Aσ,σ}\{A_{\sigma},\sigma\in{\mathcal{M}}\}, are nonnegative, then SLS is called positive SLS [12, 13, 14, 15].

II-D Definitions

Throughout the paper, we will use the following compact notations:

P:=\displaystyle P:= [P1P|𝒜|],R:=[R(,1)R(,|A|)],Q:=[Q(,1)Q(,|𝒜|)],\displaystyle\begin{bmatrix}P_{1}\\ \vdots\\ P_{|{\cal A}|}\\ \end{bmatrix},\;R:=\begin{bmatrix}R(\cdot,1)\\ \vdots\\ R(\cdot,|A|)\\ \end{bmatrix},\;Q:=\begin{bmatrix}Q(\cdot,1)\\ \vdots\\ Q(\cdot,|{\cal A}|)\\ \end{bmatrix},

where Pa=P(|,a)|𝒮|×|𝒮|P_{a}=P(\cdot|\cdot,a)\in{\mathbb{R}}^{|{\cal S}|\times|{\cal S}|}, Q(,a)|𝒮|,a𝒜Q(\cdot,a)\in{\mathbb{R}}^{|{\cal S}|},a\in{\cal A}, and R|𝒮||𝒜|R\in{\mathbb{R}}^{|{\cal S}||{\cal A}|} is an enumerate of R(s,a):=𝔼[rk|sk=s,ak=a]R(s,a):={\mathbb{E}}[r_{k}|s_{k}=s,a_{k}=a] with an appropriate order compatible with other definitions. Note that P|𝒮||𝒜|×|𝒮|P\in{\mathbb{R}}^{|{\cal S}||{\cal A}|\times|{\cal S}|}, and Q|𝒮||𝒜|Q\in{\mathbb{R}}^{|{\cal S}||{\cal A}|}. In this notation, Q-function is encoded as a single vector Q|𝒮||𝒜|Q\in{\mathbb{R}}^{|{\cal S}||{\cal A}|}, which enumerates Q(s,a)Q(s,a) for all sSs\in S and aAa\in A with an appropriate order. In particular, the single value Q(s,a)Q(s,a) can be written as Q(s,a)=(eaes)TQ,Q(s,a)=(e_{a}\otimes e_{s})^{T}Q, where es|𝒮|e_{s}\in{\mathbb{R}}^{|{\cal S}|} and ea|𝒜|e_{a}\in{\mathbb{R}}^{|{\cal A}|} are ss-th basis vector (all components are 0 except for the ss-th component which is 11) and aa-th basis vector, respectively. Therefore, in the above definitions, all entries are ordered compatible with this vector QQ.

For any stochastic policy, π:𝒮Δ|𝒮|\pi:{\cal S}\to\Delta_{|{\cal S}|}, where Δ|𝒜|\Delta_{|{\cal A}|} is the set of all probability distributions over 𝒜{\cal A}, we define the corresponding action transition matrix as

Ππ:=[π(1)Te1Tπ(2)Te2Tπ(|S|)Te|𝒮|T]|𝒮|×|𝒮||𝒜|,\displaystyle\Pi^{\pi}:=\begin{bmatrix}\pi(1)^{T}\otimes e_{1}^{T}\\ \pi(2)^{T}\otimes e_{2}^{T}\\ \vdots\\ \pi(|S|)^{T}\otimes e_{|{\cal S}|}^{T}\\ \end{bmatrix}\in{\mathbb{R}}^{|{\cal S}|\times|{\cal S}||{\cal A}|}, (5)

where es|𝒮|e_{s}\in\mathbb{R}^{|{\cal S}|}. Then, it is well-known that PΠπ|𝒮||𝒜|×|𝒮||𝒜|P\Pi^{\pi}\in\mathbb{R}^{|{\cal S}||{\cal A}|\times|{\cal S}||{\cal A}|} is the transition probability matrix of the state-action pair under policy π\pi. If we consider a deterministic policy, π:𝒮𝒜\pi:{\cal S}\to{\cal A}, the stochastic policy can be replaced with the corresponding one-hot encoding vector π(s):=eπ(s)Δ|𝒜|,\vec{\pi}(s):=e_{\pi(s)}\in\Delta_{|{\cal A}|}, where ea|𝒜|e_{a}\in\mathbb{R}^{|{\cal A}|}, and the corresponding action transition matrix is identical to (5) with π\pi replaced with π\vec{\pi}. For any given Q|𝒮||𝒜|Q\in\mathbb{R}^{|{\cal S}||{\cal A}|}, denote the greedy policy with respect to QQ as πQ(s):=argmaxa𝒜Q(s,a)𝒜\pi_{Q}(s):=\operatorname{arg\,max}_{a\in{\cal A}}Q(s,a)\in{\cal A}. We will use the following shorthand frequently: ΠQ:=ΠπQ.\Pi_{Q}:=\Pi^{\pi_{Q}}.

II-E Q-value iteration (Q-VI)

In this paper, we consider the so-called Q-value iteration (Q-VI) [1] given in Algorithm 1, where FF is called Bellman operator. It is well-known that the iterates of Q-VI converge exponentially to QQ^{*} in terms of the infinity norm \left\|\cdot\right\|_{\infty} [1, Lemma 2.5].

Lemma 2

We have the bounds for Q-VI iterates Qk+1QγQkQ\left\|{Q_{k+1}-Q^{*}}\right\|_{\infty}\leq\gamma\left\|{Q_{k}-Q^{*}}\right\|_{\infty}.

The proof is given in [1, Lemma 2.5], which is based on the contraction property of Bellman operator. Note that [1, Lemma 2.5] deals with the value iteration for value function instead of Q-function addressed in our work. However, it is equivalent to Q-VI. Next, a direct consequence of Lemma 2 is the convergence of Q-VI

QkQγkQ0Q\displaystyle\left\|{Q_{k}-Q^{*}}\right\|_{\infty}\leq\gamma^{k}\left\|{Q_{0}-Q^{*}}\right\|_{\infty} (6)
Algorithm 1 Q-VI
1:Initialize Q0|𝒮||𝒜|Q_{0}\in{\mathbb{R}}^{|{\cal S}||{\cal A}|} randomly.
2:for iteration k=0,1,k=0,1,\ldots do
3:     Update
Qk+1(s,a)=R(s,a)+γsSP(s|s,a)maxaAQk(s,a)=:FQkQ_{k+1}(s,a)=\underbrace{R(s,a)+\gamma\sum\limits_{s^{\prime}\in S}{P(s^{\prime}|s,a)\max_{a^{\prime}\in A}Q_{k}(s^{\prime},a^{\prime})}}_{=:FQ_{k}}
4:end for

In what follows, an equivalent switching system model that captures the behavior of Q-VI is introduced, and based on it, we provide a different proof of Lemma 2.

III Switching system model

In this section, we study a discrete-time switching system model of Q-VI and establish its finite-time convergence based on the stability analysis of the switching system. Using the notation introduced in Section II-D, the update in Algorithm 1 can be rewritten as

Qk+1=R+γPΠQkQk,\displaystyle Q_{k+1}=R+\gamma P\Pi_{Q_{k}}Q_{k}, (7)

Recall the definitions πQ(s)\pi_{Q}(s) and ΠQ\Pi_{Q}. Invoking the optimal Bellman equation (γPΠQI)Q+R=0(\gamma P\Pi_{Q^{*}}-I)Q^{*}+R=0, (7) can be further rewritten by

(Qk+1Q)=γPΠQk(QkQ)+γP(ΠQkΠQ)Q,\displaystyle(Q_{k+1}-Q^{*})=\gamma P\Pi_{Q_{k}}(Q_{k}-Q^{*})+\gamma P(\Pi_{Q_{k}}-\Pi_{Q^{*}})Q^{*}, (8)

which is a SAS (switched affine system). In particular, for any Q|𝒮||𝒜|Q\in{\mathbb{R}}^{|{\cal S}||{\cal A}|}, define

AQ:=γPΠQ,bQ:=γP(ΠQΠQ)Q\displaystyle A_{Q}:=\gamma P\Pi_{Q},\quad b_{Q}:=\gamma P(\Pi_{Q}-\Pi_{Q^{*}})Q^{*}

Hence, Q-VI can be concisely represented as the SAS

Qk+1Q=AQk(QkQ)+bQk,\displaystyle Q_{k+1}-Q^{*}=A_{Q_{k}}(Q_{k}-Q^{*})+b_{Q_{k}}, (9)

where AQkA_{Q_{k}} and bQkb_{Q_{k}} switch among matrices from {γPΠπ:πΘ}\{\gamma P\Pi^{\pi}:\pi\in\Theta\} and vectors from {γP(ΠπΠπ)Q:πΘ}\{\gamma P(\Pi^{\pi}-\Pi^{\pi^{*}})Q^{*}:\pi\in\Theta\} according to the changes of QkQ_{k}. Therefore, the convergence of Q-VI is now reduced to analyzing the stability of the above SAS. Th main obstacle in proving the stability arises from the presence of the affine term. Without it, we can easily establish the exponential stability of the corresponding deterministic switching system, under arbitrary switching policy. Specifically, we have the following result.

Proposition 1

For arbitrary Hk|𝒮||𝒜|,k0H_{k}\in{\mathbb{R}}^{|{\cal S}||{\cal A}|},k\geq 0, the linear switching system Qk+1Q=AHk(QkQ),Q0Q|𝒮||𝒜|Q_{k+1}-Q^{*}=A_{H_{k}}(Q_{k}-Q^{*}),Q_{0}-Q^{*}\in{\mathbb{R}}^{|{\cal S}||{\cal A}|}, is exponentially stable such that Qk+1QγQkQ,k0\|Q_{k+1}-Q^{*}\|_{\infty}\leq\gamma\|Q_{k}-Q^{*}\|_{\infty},k\geq 0 and QkQγkQ0Q,k0\|Q_{k}-Q^{*}\|_{\infty}\leq\gamma^{k}\|Q_{0}-Q^{*}\|_{\infty},k\geq 0.

The above result follows immediately from the key fact that AQγ\|A_{Q}\|_{\infty}\leq\gamma, which we formally state in the lemma below.

Lemma 3

For any Q|𝒮||𝒜|Q\in{\mathbb{R}}^{|{\cal S}||{\cal A}|}, AQγ\|A_{Q}\|_{\infty}\leq\gamma, where the matrix norm A:=max1imj=1n|Aij|\|A\|_{\infty}:=\max_{1\leq i\leq m}\sum_{j=1}^{n}{|A_{ij}|} and AijA_{ij} is the element of AA in ii-th row and jj-th column.

Proof:

Note j|[AQ]ij|=j|[γPΠQ]ij|=γ\sum_{j}|[A_{Q}]_{ij}|=\sum_{j}{|[\gamma P\Pi_{Q}]_{ij}|}=\gamma, which completes the proof. ∎

However, because of the additional affine term in the original switching system (9), it is not obvious how to directly derive its finite-time convergence. To circumvent the difficulty with the affine term, we will resort to two simpler upper and lower bounds, which are given below.

Proposition 2 (Upper and lower bounds)

For all k0k\geq 0, we have

AQ(QkQ)Qk+1QAQk(QkQ).A_{Q^{*}}(Q_{k}-Q^{*})\leq Q_{k+1}-Q^{*}\leq A_{Q_{k}}(Q_{k}-Q^{*}).
Proof:

For the lower bound, we have

(Qk+1Q)\displaystyle(Q_{k+1}-Q^{*})
=\displaystyle= AQ(QkQ)+(AQkAQ)(QkQ)+bQk\displaystyle A_{Q^{*}}(Q_{k}-Q^{*})+(A_{Q_{k}}-A_{Q^{*}})(Q_{k}-Q^{*})+b_{Q_{k}}
=\displaystyle= AQ(QkQ)+γP(ΠQkΠQ)Qk\displaystyle A_{Q^{*}}(Q_{k}-Q^{*})+\gamma P(\Pi_{Q_{k}}-\Pi_{Q^{*}})Q_{k}
\displaystyle\geq AQ(QkQ)\displaystyle A_{Q^{*}}(Q_{k}-Q^{*})

where the third line is due to P(ΠQkΠQ)QkP(ΠQΠQ)Qk=0P(\Pi_{Q_{k}}-\Pi_{Q^{*}})Q_{k}\geq P(\Pi_{Q^{*}}-\Pi_{Q^{*}})Q_{k}=0. For the upper bound, one gets

(Qk+1Q)=\displaystyle(Q_{k+1}-Q^{*})= AQk(QkQ)+bQk\displaystyle A_{Q_{k}}(Q_{k}-Q^{*})+b_{Q_{k}}
\displaystyle\leq AQk(QkQ),\displaystyle A_{Q_{k}}(Q_{k}-Q^{*}),

where we used the fact that bQk=(γPΠQkQγPΠQQ)(γPΠQQγPΠQQ)=0b_{Q_{k}}=(\gamma P\Pi_{Q_{k}}Q^{*}-\gamma P\Pi_{Q^{*}}Q^{*})\leq(\gamma P\Pi_{Q^{*}}Q^{*}-\gamma P\Pi_{Q^{*}}Q^{*})=0 in the first inequality. This completes the proof. ∎

The lower bound is in the form of an LTI system. Specifically, the lower bound corresponds to a special LTI system form known as a positive linear system [21], where all the elements of the system matrix AQA_{Q^{*}} are nonnegative. In our analysis, this property will be utilized to derive new behaviors of Q-VI. Similarly, the system matrix AQkA_{Q_{k}} in the upper bound switches according to the changes in QkQ_{k}. Therefore, it can be proven that the upper bound is in the form of positive SLSs [12, 13, 14, 15], where all the elements of the system matrix for each mode are nonnegative.

III-A Finite-time error bound of Q-VI

In this subsection, we provide a proof for the contraction property outlined in Lemma 2. Although this result represents one of the most basic and fundamental outcomes in classical dynamic programming, we offer an alternative proof in this paper. This proof, grounded in the switching system model discussed in earlier sections, is provided below.

Since AQ(QkQ)Qk+1QAQk(QkQ)A_{Q^{*}}(Q_{k}-Q^{*})\leq Q_{k+1}-Q^{*}\leq A_{Q_{k}}(Q_{k}-Q^{*}) from Proposition 2, it follows that (eaes)TAQ(QkQ)(eaes)T(QkQ)(eaes)TAQk(QkQ)(e_{a}\otimes e_{s})^{T}A_{Q^{*}}(Q_{k}-Q^{*})\leq(e_{a}\otimes e_{s})^{T}(Q_{k}-Q^{*})\leq(e_{a}\otimes e_{s})^{T}A_{Q_{k}}(Q_{k}-Q^{*}). If (eaes)T(QkQ)0(e_{a}\otimes e_{s})^{T}(Q_{k}-Q^{*})\leq 0, then |(eaes)T(QkQ)||(eaes)TAQ(QkQ)||(e_{a}\otimes e_{s})^{T}(Q_{k}-Q^{*})|\leq|(e_{a}\otimes e_{s})^{T}A_{Q^{*}}(Q_{k}-Q^{*})|, where es|𝒮|e_{s}\in{\mathbb{R}}^{|{\cal S}|} and ea|𝒜|e_{a}\in{\mathbb{R}}^{|{\cal A}|} are the ss-th and aa-th standard basis vectors, respectively. If (eaes)T(QkQ)>0(e_{a}\otimes e_{s})^{T}(Q_{k}-Q^{*})>0, then |(eaes)T(QkQ)||(eaes)TAQk(QkQ)||(e_{a}\otimes e_{s})^{T}(Q_{k}-Q^{*})|\leq|(e_{a}\otimes e_{s})^{T}A_{Q_{k}}(Q_{k}-Q^{*})|. Therefore, one gets

Qk+1Q\displaystyle\left\|{Q_{k+1}-Q^{*}}\right\|_{\infty}
\displaystyle\leq max{AQk(QkQ),AQ(QkQ)}\displaystyle\max\{\left\|{A_{Q_{k}}(Q_{k}-Q^{*})}\right\|_{\infty},\left\|{A_{Q*}(Q_{k}-Q^{*})}\right\|_{\infty}\}
\displaystyle\leq max{γQkQ,γQkQ}\displaystyle\max\{\gamma\left\|{Q_{k}-Q^{*}}\right\|_{\infty},\gamma\left\|{Q_{k}-Q^{*}}\right\|_{\infty}\}
=\displaystyle= γQkQ,\displaystyle\gamma\left\|{Q_{k}-Q^{*}}\right\|_{\infty},

which completes the proof.

IV Convergence of Q-VI on the orthant

In the previous section, we revisited the convergence of Q-VI from the perspective of switching system viewpoints. In this section, we study additional geometric behaviors of Q-VI, again using the switching system perspective. Let us suppose that QQ0Q^{*}\geq Q_{0} holds. Please note that such an initial value can be found by setting

Q0=11γ𝟏Q_{0}=-\frac{1}{{1-\gamma}}{\bf{1}}

because

11γ𝟏Q11γ𝟏-\frac{1}{{1-\gamma}}{\bf{1}}\leq Q^{*}\leq\frac{1}{{1-\gamma}}{\bf{1}}

holds according to  Lemma 1. Then, using the bound provided in  Proposition 2, it can be proved that QQk,k0Q^{*}\geq Q_{k},\forall k\geq 0. This is equivalent to saying 0QkQ,k00\geq Q_{k}-Q^{*},\forall k\geq 0. In other words, if the initial iterate Q0Q_{0} falls within the shifted orthant, QQ0Q^{*}\geq Q_{0}, then future iterates will also stay within the same set, i.e., QQk,k0Q^{*}\geq Q_{k},\forall k\geq 0.

Proposition 3

Suppose that Q0Q0Q_{0}-Q^{*}\leq 0 holds. Then, QkQ0,k0Q_{k}-Q^{*}\leq 0,\forall k\geq 0.

Proof:

For an induction argument, suppose QkQ0Q_{k}-Q^{*}\leq 0 for any k0k\geq 0. Then, by Proposition 2, it follows that Qk+1QAQk(QkQ)0Q_{k+1}-Q^{*}\leq A_{Q_{k}}(Q_{k}-Q^{*})\leq 0 because AQkA_{Q_{k}} is a nonnegative matrix. Therefore, Qk+1Q0Q_{k+1}-Q^{*}\leq 0, and the proof is completed by induction. ∎

From the results above, it can be observed that the behavior of QkQQ_{k}-Q^{*} is fully dictated by the lower bound, which is a linear function of QkQQ_{k}-Q^{*}. Therefore, the fundamental theory of linear systems can be applied to analyze the behavior of Q-VI. To facilitate this, the subsequent result reviews the Lyapunov theory for discrete-time linear systems.

Proposition 4

For any ε>0\varepsilon>0 such that γ+ε(0,1)\gamma+\varepsilon\in(0,1), there exists the corresponding positive definite M0M\succ 0 such that

AQTMAQ=(γ+ε)2(MI),A_{Q^{*}}^{T}MA_{Q^{*}}=(\gamma+\varepsilon)^{2}(M-I),

and

λmin(M)1,λmax(M)|𝒮||𝒜|1(γγ+ε)2.\displaystyle\lambda_{\min}(M)\geq 1,\;\lambda_{\max}(M)\leq\frac{|{\cal S}||{\cal A}|}{1-\left({\frac{\gamma}{\gamma+\varepsilon}}\right)^{2}}.

The proof of  Proposition 4 is provided in the Appendix. Note that  Proposition 4 can be applied to general LTI systems. However, because the lower bound takes the form of positive linear systems, it can be demonstrated that the Lyapunov matrix MM also possesses the additional special property outlined below.

Proposition 5

MM is a nonnegative matrix.

Proof:

The proof is easily done from the construction in (12). ∎

From the perspectives provided above, we can derive a bound on QkQQ_{k}-Q^{*} in terms of the weighted Euclidean norm M\left\|\cdot\right\|_{M}. This is an alternative to the infinity norm \left\|\cdot\right\|_{\infty}, which cannot be derived from classical contraction mapping arguments.

Theorem 1

For any k0k\geq 0,

Qk+1QM(γ+ε)QkQM(γ+ε)QkQ2\left\|{Q_{k+1}-Q^{*}}\right\|_{M}\leq(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{M}-(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{2}

holds

Proof:

We have

(Qk+1Q)TM(Qk+1Q)\displaystyle(Q_{k+1}-Q^{*})^{T}M(Q_{k+1}-Q^{*})
\displaystyle\leq (QkQ)TAQTMAQ(QkQ)\displaystyle(Q_{k}-Q^{*})^{T}A_{Q^{*}}^{T}MA_{Q^{*}}(Q_{k}-Q^{*})
=\displaystyle= (γ+ε)2(QkQ)TM(QkQ)\displaystyle(\gamma+\varepsilon)^{2}(Q_{k}-Q^{*})^{T}M(Q_{k}-Q^{*})
(γ+ε)2(QkQ)T(QkQ)\displaystyle-(\gamma+\varepsilon)^{2}(Q_{k}-Q^{*})^{T}(Q_{k}-Q^{*})

where the first inequality follows from the fact that MM is a nonnegative matrix and AQ(QkQ)Qk+1Q0A_{Q^{*}}(Q_{k}-Q^{*})\leq Q_{k+1}-Q^{*}\leq 0 from Proposition 2, and the equality is due to the Lyapunov theorem in Proposition 4. Taking the squared root on both sides yields the desired conclusion. ∎

The result in Theorem 1 provides

Qk+1QM(γ+ε)QkQM(γ+ε)QkQ2\left\|{Q_{k+1}-Q^{*}}\right\|_{M}\leq(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{M}-(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{2}

for an arbitrarily small ε>0\varepsilon>0, which implies

Qk+1QM(γ+ε)QkQM\left\|{Q_{k+1}-Q^{*}}\right\|_{M}\leq(\gamma+\varepsilon)\left\|{Q_{k}-Q^{*}}\right\|_{M}

and

QkQM(γ+ε)kQ0QM\left\|{Q_{k}-Q^{*}}\right\|_{M}\leq(\gamma+\varepsilon)^{k}\left\|{Q_{0}-Q^{*}}\right\|_{M}

This relationship reveals different geometric convergence behaviors: the sublevel sets associated with the infinity norm \left\|\cdot\right\|_{\infty} are squares, whereas those corresponding to the weighted Euclidean norm M\left\|\cdot\right\|_{M} are ellipsoids.

Furthermore, our new approach enables us to derive a bound on a linear function of QkQQ_{k}-Q^{*}. Specifically, for positive LTI systems, one common Lyapunov function is a linear Lyapunov function of the form V(x)=vTxV(x)=v^{T}x for a positive vector v|𝒮|×|𝒜|,v0v\in{\mathbb{R}}^{|{\cal S}|\times|{\cal A}|},v\geq 0. We proceed by providing an explicit form of such a vector vv.

Proposition 6

For any ε>0\varepsilon>0 such that γ+ε(0,1)\gamma+\varepsilon\in(0,1) and for any positive vector w|𝒮×𝒜|w\in{\mathbb{R}}^{|{\cal S}\times{\cal A}|}, define

v:=(i=0(1γ+ε)iAQi)Tw|𝒮×𝒜|\displaystyle v:=\left({\sum\limits_{i=0}^{\infty}{\left({\frac{1}{{\gamma+\varepsilon}}}\right)^{i}A_{Q^{*}}^{i}}}\right)^{T}w\in{\mathbb{R}}^{|{\cal S}\times{\cal A}|}

Then, it holds true that

v>0,wvw11γv>0,\quad\left\|w\right\|_{\infty}\leq\left\|{v}\right\|_{\infty}\leq\frac{\left\|w\right\|_{1}}{{1-\gamma}}

and

vTAQ=(γ+ε)(vTwT).\displaystyle v^{T}A_{Q^{*}}=(\gamma+\varepsilon)(v^{T}-w^{T}). (10)

The proof is given in Appendix. Proposition 6 proves that V(x)=vTxV(x)=v^{T}x plays the role of a linear Lyapunov function for the positive LTI system. From Proposition 6, one can prove another convergence result of Q-VI.

Theorem 2

For the positive vector vv given in Proposition 6, we have

(γ+ε)vT(QkQ)\displaystyle(\gamma+\varepsilon)v^{T}(Q_{k}-Q^{*})
\displaystyle\leq vT(Qk+1Q)+(γ+ε)wT(QkQ)\displaystyle v^{T}(Q_{k+1}-Q^{*})+(\gamma+\varepsilon)w^{T}(Q_{k}-Q^{*})
\displaystyle\leq (γ+ε)wT(QkQ),\displaystyle(\gamma+\varepsilon)w^{T}(Q_{k}-Q^{*}),

for all k0k\geq 0.

Proof:

Multiplying both sides of (10) in Proposition 6 by QkQQ_{k}-Q^{*} from the right leads to

(γ+ε)vT(QkQ)(γ+ε)wT(QkQ)\displaystyle(\gamma+\varepsilon)v^{T}(Q_{k}-Q^{*})-(\gamma+\varepsilon)w^{T}(Q_{k}-Q^{*})
=\displaystyle= vTAQ(QkQ)\displaystyle v^{T}A_{Q^{*}}(Q_{k}-Q^{*})
\displaystyle\leq vT(Qk+1Q)\displaystyle v^{T}(Q_{k+1}-Q^{*})
\displaystyle\leq 0\displaystyle 0

where the first equality comes from (10), the first inequality is due to AQ(QkQ)Qk+1QA_{Q^{*}}(Q_{k}-Q^{*})\leq Q_{k+1}-Q^{*} in Proposition 2, v0v\geq 0, and the second inequality follows from Qk+1Q0Q_{k+1}-Q^{*}\leq 0 in Proposition 3. This completes the proof. ∎

Proposition 6 tells us that

(γ+ε)vT(QkQ)vT(Qk+1Q)0,k0,(\gamma+\varepsilon)v^{T}(Q_{k}-Q^{*})\leq v^{T}(Q_{k+1}-Q^{*})\leq 0,\quad\forall k\geq 0,

which implies

(γ+ε)kvT(Q0Q)vT(QkQ)0,k0.\displaystyle(\gamma+\varepsilon)^{k}v^{T}(Q_{0}-Q^{*})\leq v^{T}(Q_{k}-Q^{*})\leq 0,\quad\forall k\geq 0. (11)

Suppose that QkQ=[v]k+[v]kQ_{k}-Q^{*}=[v]_{k}+[v]_{k}^{\bot}, where [v]k[v]_{k} represents the component of QkQQ_{k}-Q^{*} in the direction of vv, and [v]k[v]_{k}^{\bot} represents the components of QkQQ_{k}-Q^{*} orthogonal to vv. Then, according to (11), the component of QkQQ_{k}-Q^{*} in the direction of vv diminishes exponentially. From the construction of vv in  Proposition 6, it can be seen that there could be infinitely many such vectors vv, depending on the positive vector ww. Therefore, QkQQ_{k}-Q^{*} becomes trapped in the intersections of the nonpositive orthant and infinitely many half planes. An illustration of a single half plane is provided in Figure 1.

Refer to caption
Figure 1: Evolution of QkQQ_{k}-Q^{*} and geometric properties from Theorem 2.

V Conclusion

In this paper, we have presented additional insights on value iteration, approached through the lens of switching system models in the control community. This offers a connection between value iteration and switching system theory and reveals additional geometric behaviors of value iteration. Specifically, we introduced a switching system model of value iteration and, based on it, provided a novel proof for the contraction property of the value iteration. Moreover, our insights led to the proof of new geometric behaviors of value iteration when the initial iterate resides in a particular set (the shifted orthant). We believe that the perspectives proposed here could serve as useful tools applicable in various settings. As such, further development of these methods may present a valuable future direction.

References

  • [1] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-dynamic programming.   Athena Scientific Belmont, MA, 1996.
  • [2] D. P. Bertsekas, “Dynamic programming and optimal control 4th edition, volume ii,” Athena Scientific, 2015.
  • [3] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [4] D. Liberzon, Switching in systems and control.   Springer Science & Business Media, 2003.
  • [5] D. Lee and J. Hu, “Periodic stabilization of discrete-time switched linear systems,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3382–3394, 2017.
  • [6] ——, “Stabilizability of discrete-time controlled switched linear systems,” IEEE Transactions on Automatic Control, vol. 63, no. 10, pp. 3516–3522, 2018.
  • [7] J. C. Geromel and P. Colaneri, “Stability and stabilization of discrete time switched systems,” Int. J. Control, vol. 79, no. 7, pp. 719–728, 2006.
  • [8] T. Hu, L. Ma, and Z. Lin, “Stabilization of switched systems via composite quadratic functions,” IEEE Trans. Autom. Control, vol. 53, no. 11, pp. 2571–2585, 2008.
  • [9] H. Lin and P. J. Antsaklis, “Stability and stabilizability of switched linear systems: a survey of recent results,” IEEE Transactions on Automatic control, vol. 54, no. 2, pp. 308–322, 2009.
  • [10] W. Zhang, A. Abate, J. Hu, and M. P. Vitus, “Exponential stabilization of discrete-time switched linear systems,” Automatica, vol. 45, no. 11, pp. 2526–2536, 2009.
  • [11] D. Lee and N. He, “A unified switching system perspective and convergence analysis of q-learning algorithms,” in 34th Conference on Neural Information Processing Systems, NeurIPS 2020, 2020.
  • [12] F. Blanchini, P. Colaneri, and M. E. Valcher, “Co-positive lyapunov functions for the stabilization of positive switched systems,” IEEE Transactions on Automatic Control, vol. 57, no. 12, pp. 3038–3050, 2012.
  • [13] E. Fornasini and M. E. Valcher, “Linear copositive lyapunov functions for continuous-time positive switched systems,” IEEE Transactions on Automatic Control, vol. 55, no. 8, pp. 1933–1937, 2010.
  • [14] J. Zhang, Z. Han, and J. Huang, “Stabilization of discrete-time positive switched systems,” Circuits, Systems, and Signal Processing, vol. 32, pp. 1129–1145, 2013.
  • [15] E. Fornasini and M. E. Valcher, “Stability and stabilizability criteria for discrete-time positive switched systems,” IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1208–1221, 2011.
  • [16] H. K. Khalil, “Nonlinear systems,” Upper Saddle River, 2002.
  • [17] C.-T. Chen, Linear System Theory and Design.   Oxford University Press, Inc., 1995.
  • [18] D. Lee, J. Hu, and N. He, “A discrete-time switching system analysis of q-learning,” arXiv preprint arXiv:2102.08583, 2021.
  • [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT Press, 1998.
  • [20] X. Guo and B. Hu, “Convex programs and lyapunov functions for reinforcement learning: A unified perspective on the analysis of value-based methods,” in 2022 American Control Conference (ACC), 2022, pp. 3317–3322.
  • [21] B. Roszak and E. J. Davison, “Necessary and sufficient conditions for stabilizability of positive lti systems,” Systems & Control Letters, vol. 58, no. 7, pp. 474–481, 2009.

VI Appendix

VI-A Proof of Proposition 4

Proof:

For simplicity, denote A=AQA=A_{Q^{*}}. Consider matrix MM such that

M=k=0(1γ+ε)2k(Ak)TAk.\displaystyle M=\sum_{k=0}^{\infty}{\left({\frac{1}{\gamma+\varepsilon}}\right)^{2k}(A^{k})^{T}A^{k}}. (12)

Noting that

(γ+ε)2ATMA+I\displaystyle(\gamma+\varepsilon)^{-2}A^{T}MA+I
=\displaystyle= 1(γ+ε)2AT(k=0(1γ+ε)2k(Ak)TAk)A+I\displaystyle\frac{1}{(\gamma+\varepsilon)^{2}}A^{T}\left(\sum_{k=0}^{\infty}{\left({\frac{1}{\gamma+\varepsilon}}\right)^{2k}(A^{k})^{T}A^{k}}\right)A+I
=\displaystyle= M,\displaystyle M,

we have (γ+ε)2ATMA+I=M(\gamma+\varepsilon)^{-2}A^{T}MA+I=M, resulting in the desired conclusion. Next, it remains to prove the existence of MM by proving its boundedness. Taking the norm on MM leads to

P2=\displaystyle\left\|P\right\|_{2}= I+(γ+ε)2ATA+(γ+ε)4(A2)TA2+2\displaystyle\left\|{I+(\gamma+\varepsilon)^{-2}A^{T}A+(\gamma+\varepsilon)^{-4}(A^{2})^{T}A^{2}+\cdots}\right\|_{2}
\displaystyle\leq I2+(γ+ε)2ATA2\displaystyle\left\|I\right\|_{2}+(\gamma+\varepsilon)^{-2}\left\|{A^{T}A}\right\|_{2}
+(γ+ε)4(A2)TA22+\displaystyle+(\gamma+\varepsilon)^{-4}\left\|{(A^{2})^{T}A^{2}}\right\|_{2}+\cdots
=\displaystyle= I2+(γ+ε)2A22+(γ+ε)4A222+\displaystyle\left\|I\right\|_{2}+(\gamma+\varepsilon)^{-2}\left\|A\right\|_{2}^{2}+(\gamma+\varepsilon)^{-4}\left\|{A^{2}}\right\|_{2}^{2}+\cdots
=\displaystyle= 1+|𝒮||𝒜|(γ+ε)2A2\displaystyle 1+|{\cal S}||{\cal A}|(\gamma+\varepsilon)^{-2}\left\|A\right\|_{\infty}^{2}
+|S||A|(γ+ε)4A22+\displaystyle+|S||A|(\gamma+\varepsilon)^{-4}\left\|{A^{2}}\right\|_{\infty}^{2}+\cdots
=\displaystyle= 1|𝒮||𝒜|+|𝒮||𝒜|1(γγ+ε)2.\displaystyle 1-|{\cal S}||{\cal A}|+\frac{{|{\cal S}||{\cal A}|}}{{1-\left({\frac{\gamma}{{\gamma+\varepsilon}}}\right)^{2}}}.

Finally, we prove the bounds on the maximum and minimum eigenvalues. From the definition (12), MIM\succeq I, and hence λmin(M)1\lambda_{\min}(M)\geq 1. On the other hand, one gets

λmax(M)\displaystyle\lambda_{\max}(M)
=\displaystyle= λmax(I+(γ+ε)2ATA\displaystyle\lambda_{\max}(I+(\gamma+\varepsilon)^{-2}A^{T}A
+(γ+ε)4(A2)TA2+)\displaystyle+(\gamma+\varepsilon)^{-4}(A^{2})^{T}A^{2}+\cdots)
\displaystyle\leq λmax(I)+(γ+ε)2λmax(ATA)\displaystyle\lambda_{\max}(I)+(\gamma+\varepsilon)^{-2}\lambda_{\max}(A^{T}A)
+(γ+ε)4λmax((A2)TA2)+\displaystyle+(\gamma+\varepsilon)^{-4}\lambda_{\max}((A^{2})^{T}A^{2})+\cdots
=\displaystyle= λmax(I)+(γ+ε)2A22+(γ+ε)4A222+\displaystyle\lambda_{\max}(I)+(\gamma+\varepsilon)^{-2}\|A\|_{2}^{2}+(\gamma+\varepsilon)^{-4}\|A^{2}\|_{2}^{2}+\cdots
\displaystyle\leq 1+|𝒮||𝒜|(γ+ε)2A2\displaystyle 1+|{\cal S}||{\cal A}|(\gamma+\varepsilon)^{-2}\|A\|_{\infty}^{2}
+|𝒮||𝒜|(γ+ε)4A22+\displaystyle+|{\cal S}||{\cal A}|(\gamma+\varepsilon)^{-4}\|A^{2}\|_{\infty}^{2}+\cdots
\displaystyle\leq |𝒮||𝒜|1(γγ+ε)2,\displaystyle\frac{|{\cal S}||{\cal A}|}{1-\left(\frac{\gamma}{\gamma+\varepsilon}\right)^{2}},

where |𝒮||\mathcal{S}| and |𝒜||\mathcal{A}| denote the cardinality of the sets 𝒮\cal S and 𝒜\cal A, respectively. The proof is completed. ∎

VI-B Proof of Proposition 6

Proof:

We have

vTAQ=\displaystyle v^{T}A_{Q^{*}}= wT(i=0(1γ+ε)iAQi)AQ\displaystyle w^{T}\left({\sum\limits_{i=0}^{\infty}{\left({\frac{1}{{\gamma+\varepsilon}}}\right)^{i}A_{Q^{*}}^{i}}}\right)A_{Q^{*}}
=\displaystyle= (γ+ε)wT(i=1(1γ+ε)iAQi)\displaystyle(\gamma+\varepsilon)w^{T}\left({\sum\limits_{i=1}^{\infty}{\left({\frac{1}{{\gamma+\varepsilon}}}\right)^{i}A_{Q^{*}}^{i}}}\right)
=\displaystyle= (γ+ε){wT(i=0(1γ+ε)iAQi)wT}\displaystyle(\gamma+\varepsilon)\left\{{w^{T}\left({\sum\limits_{i=0}^{\infty}{\left({\frac{1}{{\gamma+\varepsilon}}}\right)^{i}A_{Q^{*}}^{i}}}\right)-w^{T}}\right\}
=\displaystyle= (γ+ε)(vTwT)\displaystyle(\gamma+\varepsilon)(v^{T}-w^{T})

Moreover,

v=\displaystyle\left\|v\right\|_{\infty}= wT(i=0AQi)\displaystyle\left\|{w^{T}\left({\sum\limits_{i=0}^{\infty}{A_{Q^{*}}^{i}}}\right)}\right\|_{\infty}
\displaystyle\leq wTi=0AQi\displaystyle\left\|w^{T}\right\|_{\infty}\left\|{\sum\limits_{i=0}^{\infty}{A_{Q^{*}}^{i}}}\right\|_{\infty}
\displaystyle\leq w1i=0AQi\displaystyle\|w\|_{1}\sum\limits_{i=0}^{\infty}{\left\|{A_{Q^{*}}^{i}}\right\|_{\infty}}
\displaystyle\leq w1i=0γi\displaystyle\|w\|_{1}\sum\limits_{i=0}^{\infty}{\gamma^{i}}
=\displaystyle= w11γ\displaystyle\frac{\|w\|_{1}}{{1-\gamma}}

and

vw\left\|v\right\|_{\infty}\geq\left\|w\right\|_{\infty}