This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Markov Chain Monte Carlo for Koopman-based Optimal Control: Technical Report

João Hespanha1    Kerem Çamsarı2 *This material is based upon work supported by the U.S. Office of Naval Research MURI grant No. N00014-23-1-2708 and by the National Science Foundation grant No. 2229876.Univ. of California, Santa Barbara, USA; 1[email protected], 2[email protected]
Abstract

We propose a Markov Chain Monte Carlo (MCMC) algorithm based on Gibbs sampling with parallel tempering to solve nonlinear optimal control problems. The algorithm is applicable to nonlinear systems with dynamics that can be approximately represented by a finite dimensional Koopman model, potentially with high dimension. This algorithm exploits linearity of the Koopman representation to achieve significant computational saving for large lifted states. We use a video-game to illustrate the use of the method.

I INTRODUCTION

This paper address the optimal control of nonlinear systems that have reasonably accurate finite-dimensional representations in terms of their Koopman operator [20]. While Koopman’s pioneering work is almost a century old, its use as a practical tool to model complex dynamics is much more recent and only became practical when computational tools became available for the analysis of systems with hundreds to thousands of dimensions. The use of Koopman models for control is even more recent, but has attracted significant attention in the last few years [9, 21, 28, 23, 6, 27].

The linear structure of the Koopman representation permits very efficient solutions for optimal control when the lifted dynamics are linear on the control input, as in [9, 21, 28]. However, linearity in the control severely limits the class of applicable dynamics. Bilinear representations are more widely applicable [23, 6, 27], but are also harder to control.

When the set of admissible control inputs is finite, the Koopman representation can be viewed as a switched linear system, where the optimal control selects, at each time step, one out of several admissible dynamics [7]. The optimal control of switched system it typically difficult [5, 31], but when the optimization criterion is linear in the lifted state, the dynamic programming cost-to-go is concave and piecewise linear, with a simple representation in terms of a minimum over a finite set of linear functions. This observation enabled the design of efficient algorithms that combine dynamic programming with dynamic pruning [7].

This paper exploits the structure of the Koopman representation to develop efficient Markov Chain Monte Carlo (MCMC) sampling method for optimal control. Following the pioneering work of [19, 10, 14], we draw samples from a Boltzmann distribution with energy proportional to the criterion to minimize. The original use of MCMC methods for combinatorial optimization relied on a gradual decrease in temperature, now commonly known as simulated annealing, to prevent the chain from getting trapped into states that do not minimize energy. An alternative approach relies on the use of multiple replicas of a Markov chain, each generating samples for a different temperature. The introduction of multiple replicas of a Markov chain to improve the mixing time can be traced back to [30]. The more recent form of parallel tempering (also known as Metropolis–coupled MCMC, or exchange Monte Carlo) is due to [15, 17].

Our key contribution is an MCMC algorithm that combines Gibbs sampling [14, 13] with parallel tempering to solve the switched linear optimizations that arise from the Koopman representation of nonlinear optimal control problems. This algorithm is computationally efficient due to the combination of two factors: (i) the linear structure of the cost function enables the full variable sweep need for Gibbs sampling to be performed with computation that scales linearly with the horizon length TT and (ii) parallel tempering can be fully parallelized across computation cores, as noted in [11]. With regard to (i), the computational complexity of evaluation Gibbs’ conditional distribution for each optimization variable is of order Tnψ2Tn_{\psi}^{2}, where nψn_{\psi} denotes the size of the lifted state. Our algorithm computes the conditional distributions for all TT variables in a full Gibbs’ sweep with computation still just of order Tnψ2Tn_{\psi}^{2}. With regard to (ii), while here we only explore parallelization across CPU cores, in the last few years hardware parallelization using GPUs and FPGAs has achieved orders of magnitude increase in the number of samples generated with MCMC sampling [25, 3].

The remaining of this paper is organized as follows: Section II shows how a nonlinear control problem can be converted into a switching linear systems optimization, using the Koopman operator. Section III provides basic background on MCMC, Gibbs sampling, and parallel tempering. Our optimization algorithm is described in Section IV and its use is illustrated in Section V in the context of a video game.

II OPTIMAL CONTROL OF KOOPMAN MODELS

Given a discrete-time nonlinear system of the form

xt+1=f(xt,ut),t𝒯,xt𝒳,ut𝒰t,\displaystyle x_{t+1}=f(x_{t},u_{t}),\quad\forall t\in\mathcal{T},\;x_{t}\in\mathcal{X},\;u_{t}\in\mathcal{U}_{t}, (1)

with the time tt taking values over 𝒯{1,,T}\mathcal{T}\coloneq\{1,\dots,T\}, our goal is to solve a final-state optimal control problem of the form

Jminu𝒰J(u),J(u)g(xT),\displaystyle J^{*}\coloneq\min_{u\in\mathcal{U}}J(u),\quad J(u)\coloneq g(x_{T}), (2)

where u(u1,,uT)𝒰𝒰1××𝒰Tu\coloneq(u_{1},\dots,u_{T})\in\mathcal{U}\coloneq\mathcal{U}_{1}\times\cdots\times\mathcal{U}_{T} denotes the control sequence to be optimized, which we assume finite but potentially with a large number of elements.

For each input u𝒰tu\in\mathcal{U}_{t}, t𝒯t\in\mathcal{T}, the Koopman operator KuK_{u} for the system (1) operates on the linear space of functions \mathcal{F} from 𝒳\mathcal{X} to \mathbb{R} and is defined by

φ()φ(f(,u)).\displaystyle\varphi(\cdot)\in\mathcal{F}\mapsto\varphi(f(\cdot,u))\in\mathcal{F}.

Assuming there is a finite dimensional linear subspace inv\mathcal{F}_{\mathrm{inv}} of \mathcal{F} that is invariant for every Koopman operator in the family {Ku:u𝒰t,t𝒯}\{K_{u}:\forall u\in\mathcal{U}_{t},\;t\in\mathcal{T}\}, there is an associated family of matrices {A(u)nψ×nψ:u𝒰t,t𝒯}\{A(u)\in\mathbb{R}^{n_{\psi}\times n_{\psi}}:\forall u\in\mathcal{U}_{t},\;t\in\mathcal{T}\} such that

ψt+1=A(ut)ψt,t𝒯,ψtnψ,ut𝒰t,\displaystyle\psi_{t+1}=A(u_{t})\psi_{t},\quad\forall t\in\mathcal{T},\;\psi_{t}\in\mathbb{R}^{n_{\psi}},\;u_{t}\in\mathcal{U}_{t}, (3)

where ψt[φ1(xt)φnψ(xt)]nψ\psi_{t}\coloneq\begin{bmatrix}\varphi_{1}(x_{t})&\cdots&\varphi_{n_{\psi}}(x_{t})\end{bmatrix}^{\prime}\in\mathbb{R}^{n_{\psi}} and the functions {φ1(),,φnψ()}\{\varphi_{1}(\cdot),\dots,\varphi_{n_{\psi}}(\cdot)\} form a basis for inv\mathcal{F}_{\mathrm{inv}} [12, 7, 16]. If in addition, the function g()g(\cdot) in (2) also belongs to inv\mathcal{F}_{\mathrm{inv}}, there also exists a row vector c1×nψc\in\mathbb{R}^{1\times n_{\psi}} such that

g(xT)=cψT,\displaystyle g(x_{T})=c\psi_{T}, (4)

which enables re-writing the optimization criterion (2) as

J(u)cψT=cA(uT)A(u1)x1.\displaystyle J(u)\coloneq c\psi_{T}=cA(u_{T})\cdots A(u_{1})x_{1}. (5)

We can thus view the original optimal control problem for the nonlinear system (1) as a switched linear control problem [7]. Note that it is always possible to make sure that (4) hold for some vector cc by including g(xt)g(x_{t}) as one of the entries of ψt\psi_{t}.

It is generally not possible to find a finite-dimensional subspace inv\mathcal{F}_{\mathrm{inv}} that contains g()g(\cdot) and is invariant for every {Ku:u𝒰t,t𝒯}\{K_{u}:u\in\mathcal{U}_{t},\,t\in\mathcal{T}\}. Instead, we typically work with a finite dimensional space for which (3)–(4) hold up to some error. However, to make this error small, we typically need to work with high-dimensional subspaces, i.e., large values of nψn_{\psi}. Quantifying the impact on the error of the lifted space dimension nψn_{\psi} and of the size of the dataset used to learn the Koopman dynamics is challenging, but notable progress has been reported in [16, 26]. Much work remains, e.g., in quantifying errors for systems with isolated limit sets [4, 22].

III MCMC METHODS FOR OPTIMIZATION

To solve a combinatorial minimization of the form

Jminu𝒰J(u).\displaystyle J^{*}\coloneq\min_{u\in\mathcal{U}}J(u). (6)

over a finite set 𝒰\mathcal{U}, it is convenient to consider the following Boltzmann distribution with energies J(u)J(u), u𝒰u\in\mathcal{U}:

p(u;β)\displaystyle p(u;\beta) eβJ(u)Q(β),u𝒰,\displaystyle\coloneq\frac{e^{-\beta J(u)}}{Q(\beta)},\;\forall u\in\mathcal{U}, Q(β)u¯𝒰eβJ(u¯),\displaystyle Q(\beta)\coloneq\sum_{\bar{u}\in\mathcal{U}}e^{-\beta J(\bar{u})}, (7)

for some constant β0\beta\geq 0. For consistency with statistical mechanics, we say that values of β\beta close to zero correspond to high temperatures, whereas large values correspond to low temperatures. The normalization function Q(β)Q(\beta) is called the canonical partition function.

For β=0\beta=0 (high temperature), the Boltzmann distribution becomes uniform and all u𝒰u\in\mathcal{U} are equally probable. However, as we increase β\beta (lower the temperature) all probability mass is concentrated on the subset of 𝒰\mathcal{U} that minimizes the energy J(u)J(u). This can be seen by noting that the ratio of the probability between two u,u¯𝒰u,\bar{u}\in\mathcal{U} is given by

p(u;β)p(u¯;β)=eβ(J(u)J(u¯))β+{0J(u)>J(u¯)1J(u)=J(u¯)J(u)<J(u¯),\displaystyle\frac{p(u;\beta)}{p(\bar{u};\beta)}=e^{-\beta(J(u)-J(\bar{u}))}\xrightarrow[\beta\to+\infty]{}\begin{cases}0&J(u)>J(\bar{u})\\ 1&J(u)=J(\bar{u})\\ \infty&J(u)<J(\bar{u}),\end{cases}

which shows that the value with higher energy (and higher cost) will eventually never be selected.

This motivates a procedure to solve (6): draw samples from a random variable 𝒖β{\bm{u}}_{\beta} with Boltzmann distribution given by (7) with β\beta sufficiently high (temperature sufficiently low) so that all samples correspond to states with the minimum energy/cost.

The following result confirms that the probability δ\delta that we can extract NN samples from the Boltzmann distribution with energies much larger than the minimum JJ^{*} can be made arbitrarily small by increasing NN in proportion to just the logarithm of δ\delta, provided that β\beta is sufficiently high (temperature is sufficiently low).

Lemma 1

Let {𝐮[1],,𝐮[N]}\{{\bm{u}}[1],\dots,{\bm{u}}[N]\} be independent samples from the Boltzmann distribution (7). For every ϵ0\epsilon\geq 0, δ>0\delta>0

P(J(𝒖[i])J+ϵ,i)δ,\displaystyle\operatorname{P}\big{(}J({\bm{u}}[i])\geq J^{*}+\epsilon,\;\forall i\big{)}\leq\delta,

provided that

β\displaystyle\beta log(|𝒰|/|𝒰|)ϵ,\displaystyle\geq\frac{\log(|\mathcal{U}|/|\mathcal{U}^{*}|)}{\epsilon}, N\displaystyle N log(1/δ)βϵlog(|𝒰|/|𝒰|),\displaystyle\geq\frac{\log(1/\delta)}{\beta\epsilon-\log(|\mathcal{U}|/|\mathcal{U}^{*}|)},

where 𝒰=argminu𝒰J(u){u𝒰:J(u)=J}\mathcal{U}^{*}=\operatorname*{arg\;min}_{u\in\mathcal{U}}J(u)\coloneq\{u\in\mathcal{U}:J(u)=J^{*}\}.  \Box

The number of samples NN required to achieve a small probability δ\delta is a function of the fraction |𝒰|/|𝒰||\mathcal{U}^{*}|/|\mathcal{U}| of elements of 𝒰\mathcal{U} that minimize J(u)J(u). The result is particularly interesting in that the magnitude of β\beta and number of samples NN depends on the logarithm of this fraction, meaning that the size of 𝒰\mathcal{U} could grow exponentially with respect to some scaling variable (like the time horizon in optimal control) and yet β\beta and NN would only need to grow linearly. However, the optimism arising from this bound needs to be tempered by the observation that this result assumes that the samples are independent and this is typically hard to achieve for large values of β\beta, as discussed below.

III-A Markov Chain Monte Carlo Sampling

Consider a discrete-time Markov chain {𝒖[1],𝒖[2],}\{{\bm{u}}[1],{\bm{u}}[2],\dots\} on a finite set 𝒰\mathcal{U}, with transition probability

p(u|u)=P(𝒖[k+1]=u|𝒖[k]=u),u,u𝒰,k1.\displaystyle p(u^{\prime}|u)=\operatorname{P}\big{(}{\bm{u}}[k+1]=u^{\prime}\,|\,{\bm{u}}[k]=u),\;\forall u^{\prime},u\in\mathcal{U},\,k\geq 1.

Combining the probabilities of all possible realizations of 𝒖[k]{\bm{u}}[k] in a row vector

p[k][P(𝒖[k]=u)]u𝒰\displaystyle p[k]\coloneq\big{[}\operatorname{P}({\bm{u}}[k]=u)\big{]}_{u\in\mathcal{U}}

and organizing the values of the transition probabilities p(u|u)p(u^{\prime}|u) as a transition matrix

P[p(u|u)]u,u𝒰,\displaystyle P\coloneq\Big{[}p(u^{\prime}|u)\Big{]}_{u,u^{\prime}\in\mathcal{U}},

with one u𝒰u^{\prime}\in\mathcal{U} per column and one u𝒰u\in\mathcal{U} per row, enables us to express the evolution of the p[k]p[k] as

p[k+K]=p[k]PK,k,K1.\displaystyle p[k+K]=p[k]P^{K},\quad\forall k,K\geq 1.

A Markov chain is called regular if that there exists an integer NN such that all entries of PNP^{N} are strictly positive. In essence, this means that any u¯𝒰\bar{u}\in\mathcal{U} can be reached in NN steps from any other u~𝒰\tilde{u}\in\mathcal{U} through a sequence of transitions with positive probability p(u|u)>0p(u^{\prime}|u)>0, u,u𝒰u,u^{\prime}\in\mathcal{U}. The following result adapted from [18, Chapter IV] is the key property behind MCMC sampling:

Theorem 1 (Fundamental Theorem of Markov Chains)

For every regular Markov chain on a finite set 𝒰\mathcal{U} and transition matrix PP, there exists a vector π\pi such that:

  1. 1.

    The Markov chain is geometrically ergodic, meaning that there exists constants c>0,λ[0,1)c>0,\lambda\in[0,1) such that

    p[k]PKπcλK,k,K0.\displaystyle\|p[k]P^{K}-\pi\|\leq c\lambda^{K},\quad\forall k,K\geq 0. (8)
  2. 2.

    The vector π\pi is called the invariant distribution and is the unique solution to the global balance equation:

    πP=π,\displaystyle\pi P=\pi, π𝟏=1,\displaystyle\pi\bm{1}=1, π0.\displaystyle\pi\geq 0. (9)

To use an MCMC method to draw samples from a desired distribution p(u;β)p(u;\beta), we construct a discrete-time Markov Chain that is regular and with an invariant distribution π\pi that matches p(u;β)p(u;\beta). We then “solve” our sampling problem by only accepting a subsequence of samples 𝒖[K+1],𝒖[2K+1],{\bm{u}}[K+1],{\bm{u}}[2K+1],\dots with KK “sufficiently large” so that the distribution p[K+1]=p[1]PKp[K+1]=p[1]P^{K} of the sample 𝒖[K+1]{\bm{u}}[K+1] satisfies

p[1]PKπϵ\displaystyle\|p[1]P^{K}-\pi\|\leq\epsilon (8)\displaystyle\xLeftarrow{\text{\eqref{eq:exponential}}} Klogc+logϵ1logλ1.\displaystyle K\geq\frac{\log c+\log\epsilon^{-1}}{\log\lambda^{-1}}. (10)

for some “sufficiently small” ϵ\epsilon. Such KK guarantees that the samples 𝒖[K+1],𝒖[2K+1],{\bm{u}}[K+1],{\bm{u}}[2K+1],\dots may not quite have the desired distribution, but are away from it by no more than ϵ\epsilon. Two samples separated by KK are not quite independent, but “almost”. Independence between 𝒖[K+1]{\bm{u}}[K+1] and 𝒖[2K+1]{\bm{u}}[2K+1] would mean that the following two conditional distributions must have the same value

P(𝒖[2K+1]|𝒖[K+1]=u)=puPK,\displaystyle\operatorname{P}({\bm{u}}[2K+1]\,|\,{\bm{u}}[K+1]=u)=p_{u}P^{K},
P(𝒖[2K+1]|𝒖[K+1]=u¯)=pu¯PK,\displaystyle\operatorname{P}({\bm{u}}[2K+1]\,|\,{\bm{u}}[K+1]=\bar{u})=p_{\bar{u}}P^{K},

for every two distributions pup_{u} and pu¯p_{\bar{u}} that place all the probability weight at uu and u¯𝒰\bar{u}\in\mathcal{U}, respectively. In general these conditional probabilities will not be exactly equal, but we can upper bound their difference by

puPKpu¯PKpuPKπ+πpu¯PK2ϵ.\displaystyle\|p_{u}P^{K}-p_{\bar{u}}P^{K}\|\leq\|p_{u}P^{K}-\pi\|+\|\pi-p_{\bar{u}}P^{K}\|\leq 2\epsilon.

While we do not quite have independence, this shows that any two conditional distributions of 𝒖[K+1],𝒖[2K+1],{\bm{u}}[K+1],{\bm{u}}[2K+1],\dots are within 2ϵ2\epsilon of each other.

III-A1 Balance

To make sure that the chain’s invariant distribution π\pi matches a desired sampling distribution p(u;β)p(u;\beta), u𝒰u\in\mathcal{U}, we need the latter to satisfy the global balance equation (9), which can be re-written in non-matrix form as

p(u;β)=u𝒰p(u|u)p(u;β),u𝒰.\displaystyle p(u^{\prime};\beta)=\sum_{u\in\mathcal{U}}p(u^{\prime}|u)p(u;\beta),\quad\forall u^{\prime}\in\mathcal{U}. (11)

A sufficient (but not necessary) condition for global balance is detailed balance [8], which instead asks for

p(u|u)p(u;β)=p(u|u)p(u;β),u,u𝒰.\displaystyle p(u|u^{\prime})p(u^{\prime};\beta)=p(u^{\prime}|u)p(u;\beta),\quad\forall u^{\prime},u\in\mathcal{U}. (12)

III-A2 Mixing Times

We saw above that to obtain (approximately) independent samples with the desired distribution we can only use one out of KK samples from the Markov chain with KK satisfying (10).

Since ϵ1\epsilon^{-1} appears in (10) “inside” a logarithm, we can get ϵ\epsilon extremely small without having to increase KK very much. However, the dependence of KK on λ\lambda can be more problematic because λ[0,1)\lambda\in[0,1) can be very close to 1. The multiplicative factor 1logλ1\frac{1}{\log\lambda^{-1}} is typically called the mixing time of the Markov chain and is a quantity that we would like to be small to minimize the number of “wasted samples.”

Computing the mixing time is typically difficult, but many bounds for it are available [29]. One of the simplest bounds due to Hoffman states that

λpmaxpminpmax+pmin1logλ111λpmax+pmin2pmin\lambda\leq\frac{p_{\max}-p_{\min}}{p_{\max}+p_{\min}}\\ \quad\Rightarrow\quad\frac{1}{\log\lambda^{-1}}\leq\frac{1}{1-\lambda}\leq\frac{p_{\max}+p_{\min}}{2p_{\min}} (13)

where

pmaxu𝒰maxu𝒰p(u|u),\displaystyle p_{\max}\coloneq\sum_{u^{\prime}\in\mathcal{U}}\max_{u\in\mathcal{U}}p(u^{\prime}|u), pminu𝒰minu𝒰p(u|u).\displaystyle p_{\min}\coloneq\sum_{u^{\prime}\in\mathcal{U}}\min_{u\in\mathcal{U}}p(u^{\prime}|u).

III-B Gibbs Sampling

Gibbs sampling is an MCMC sampling algorithm that generates a Markov chain with a desired multivariable distribution. This method assumes that, while sampling from the joint distribution is difficult, it is easy to sample from conditional distributions.

We start with a function p(u;β)p(u;\beta), u𝒰u\in\mathcal{U} that defines a desired multi-variable joint distribution up to a normalization constant. We use the subscript notation utu_{t} to refer to the variable t𝒯t\in\mathcal{T} in uu and utu_{-t} to refer to the remaining variables. The algorithm operates as follows:

Algorithm 1 (Gibbs sampling)
  1. 1.

    Pick arbitrary 𝒖[1]𝒰{\bm{u}}[1]\in\mathcal{U}.

  2. 2.

    Set k=1k=1 and repeat until enough samples are collected:

    • Variable sweep: For each t𝒯t\in\mathcal{T}:

      • Sample 𝒖t[k+1]{\bm{u}}_{t}[k+1] with the desired conditional distribution of 𝒖t{\bm{u}}_{t}, given 𝒖t[k]{\bm{u}}_{-t}[k]:

        p(ut,𝒖t[k];β)u¯tp(u¯t,𝒖t[k];β),\displaystyle\frac{p(u_{t},{\bm{u}}_{-t}[k];\beta)}{\sum_{\bar{u}_{t}}p(\bar{u}_{t},{\bm{u}}_{-t}[k];\beta)}, (14)
      • Set 𝒖t[k+1]=𝒖t[k]{\bm{u}}_{-t}[k+1]={\bm{u}}_{-t}[k] and increment kk. \Box

III-B1 Balance

It is straightforward to show that the Gibbs transition probability for a sample corresponding to the update of each variable 𝒖t[k]{\bm{u}}_{t}[k], t𝒯t\in\mathcal{T} satisfies the detailed balance equation (12) for the desired distribution p(u;β)p(u;\beta):

pt(u|u)p(u;β)\displaystyle p_{t}(u^{\prime}|u)p(u;\beta) =pt(u|u)p(u;β),u,u𝒰\displaystyle=p_{t}(u|u^{\prime})p(u^{\prime};\beta),\quad\forall u,u^{\prime}\in\mathcal{U}

which means that we also have global balance. Denoting by PtP_{t} the corresponding transition matrix, this means that

πβPt=πβ,t𝒯,\displaystyle\pi_{\beta}P_{t}=\pi_{\beta},\quad\forall t\in\mathcal{T},

where πβ\pi_{\beta} denotes the vector of probabilities associated with the desired distribution p(u;β)p(u;\beta). A full variable sweep has transition matrix given by

PsweepP1P2PT,\displaystyle P_{\mathrm{sweep}}\coloneq P_{1}P_{2}\cdots P_{T},

which also satisfies global balance because

πβPsweep=πβP1P2PT=πβP2PT==πβ.\displaystyle\pi_{\beta}P_{\mathrm{sweep}}=\pi_{\beta}P_{1}P_{2}\cdots P_{T}=\pi_{\beta}P_{2}\cdots P_{T}=\cdots=\pi_{\beta}.

III-B2 Regularity and Convergence

The condition

p(u;β)>0,u𝒰,\displaystyle p(u;\beta)>0,\quad\forall u\in\mathcal{U}, (15)

guarantees that for every two possible values of the Markov chain’s state, there is one path of nonzero probability that takes the state from one value to the other over a full variable sweep (essentially by changing one variable at a time). This means that the Markov chain generated by Gibbs sampling is regular over the TT steps of a variable sweep. To verify that this is so, imagine that at the start of a sweep (t=1t=1) we have some arbitrary 𝒖[k]=u𝒰{\bm{u}}[k]=u\in\mathcal{U} and we want to verify that there is nonzero probability that we will end up at some other arbitrary 𝒖[k+n]=u𝒰{\bm{u}}[k+n]=u^{\prime}\in\mathcal{U} at the end of the sweep.

In the first sweep step (t=1t=1) only the first variable of 𝒖[k]{\bm{u}}[k] changes, but (15) guarantees that there is a positive probability that 𝒖[k]{\bm{u}}[k] we will transition to some 𝒖[k+1]{\bm{u}}[k+1] for which the first variable matches that of uu^{\prime}, i.e., 𝒖1[k+1]=u1{\bm{u}}_{1}[k+1]=u^{\prime}_{1}, while all other variables remain unchanged. At the second sweep step (t=2t=2) only the first variable of 𝒖[k+1]{\bm{u}}[k+1] changes and now (15) guarantees that there is a positive probability that the second variable will also match that of uu^{\prime}, i.e., we will get 𝒖1[k+1]=u1{\bm{u}}_{1}[k+1]=u^{\prime}_{1} and 𝒖2[k+1]=u2{\bm{u}}_{2}[k+1]=u^{\prime}_{2}. Continuing this reasoning, we conclude that there is a positive probability that by the last sweep step the whole 𝒖[k+T]{\bm{u}}[k+T] matches uu^{\prime}.

Having concluded that Gibbs sampling generates a regular Markov chain across a full variable sweep, Theorem 1 allow us to conclude that the distribution of 𝒖[k]{\bm{u}}[k] converges exponentially fast to the desired distribution. However, we shall see shortly that the mixing time can scale poorly with the temperature parameter β\beta.

III-B3 The binary case

For a cost function J(u)J(u) that takes binary values in {0,1}\{0,1\} and each u𝒰u\in\mathcal{U} is an nn-tuple of binary values one can compute bounds on the values of the transition probabilities psweep(u|u)p_{\mathrm{sweep}}(u^{\prime}|u), u,u𝒰u,u^{\prime}\in\mathcal{U} over the nn steps of a variable sweep:

psweep(u|u){11+eβ(eβ1+eβ)n1J(u)=0(eβ1+eβ)nJ(u)=1.\displaystyle p_{\mathrm{sweep}}(u^{\prime}|u)\geq\begin{cases}\frac{1}{1+e^{-\beta}}\Big{(}\frac{e^{-\beta}}{1+e^{-\beta}}\Big{)}^{n-1}&J(u^{\prime})=0\\ \Big{(}\frac{e^{-\beta}}{1+e^{-\beta}}\Big{)}^{n}&J(u^{\prime})=1.\end{cases}

from which we can conclude that

pmin\displaystyle p_{\min} u𝒰minu𝒰psweep(u|u)\displaystyle\coloneq\sum_{u^{\prime}\in\mathcal{U}}\min_{u\in\mathcal{U}}p_{\mathrm{sweep}}(u^{\prime}|u)
(eβ1+eβ)n1|𝒰|+eβ(2n|𝒰|)1+eβ\displaystyle\geq\Big{(}\frac{e^{-\beta}}{1+e^{-\beta}}\Big{)}^{n-1}\frac{|\mathcal{U}^{*}|+e^{-\beta}(2^{n}-|\mathcal{U}^{*}|)}{1+e^{-\beta}}
pmax\displaystyle p_{\max} u𝒰maxu𝒰psweep(u|u)2n,\displaystyle\coloneq\sum_{u^{\prime}\in\mathcal{U}}\max_{u\in\mathcal{U}}p_{\mathrm{sweep}}(u^{\prime}|u)\leq 2^{n},

where 𝒰{u𝒰:J(u)=0}\mathcal{U}^{*}\coloneq\{u\in\mathcal{U}:J(u)=0\}. Using this in the bound (13), we get

1logλ1(eβ+1)n+(eβ1)|𝒰|2n+14(1eβ)|𝒰|,\displaystyle\frac{1}{\log\lambda^{-1}}\leq\frac{(e^{\beta}+1)^{n}+(e^{\beta}-1)\frac{|\mathcal{U}^{*}|}{2^{n}}+1}{4(1-e^{-\beta})|\mathcal{U}^{*}|},

which shows a bound for the mixing time growing exponentially with TβT\beta. This bound is typically very conservative and, in fact, when 𝒰\mathcal{U}^{*} has a single element, it is possible to obtain the following much tighter bounds:

pmin\displaystyle p_{\min} u𝒰minu𝒰psweep(u|u)eβ(2n1)+12n1(1+eβ)\displaystyle\coloneq\sum_{u^{\prime}\in\mathcal{U}}\min_{u\in\mathcal{U}}p_{\mathrm{sweep}}(u^{\prime}|u)\geq\frac{e^{-\beta}(2^{n}-1)+1}{2^{n-1}(1+e^{-\beta})}
pmax\displaystyle p_{\max} u𝒰maxu𝒰psweep(u|u)1+11+eβ2,\displaystyle\coloneq\sum_{u^{\prime}\in\mathcal{U}}\max_{u\in\mathcal{U}}p_{\mathrm{sweep}}(u^{\prime}|u)\leq 1+\frac{1}{1+e^{-\beta}}\leq 2,

that lead to

1logλ11+2eβ+2n(1eβ)eβ+2n(1eβ)neβ+2,\displaystyle\frac{1}{\log\lambda^{-1}}\leq\frac{1+2e^{-\beta}+2^{-n}(1-e^{-\beta})}{e^{-\beta}+2^{-n}(1-e^{-\beta})}\xrightarrow[n\to\infty]{}e^{\beta}+2,

showing that the mixing time remains bounded as TT\to\infty, but increases exponentially as β\beta increases (temperature decreases).

III-C Parallel tempering

Tempering decreases the mixing time of a Markov chain by creating high-probability “shortcuts” between states. It is applicable whenever we can embed the desired distribution into a family of distributions parameterized by a parameter β[βmin,βmax]\beta\in[\beta_{\min},\beta_{\max}], with the property that we have slow mixing for the desired distribution, which corresponds to β=βmax\beta=\beta_{\max}, but we have fast mixing for the distribution corresponding to β=βmin\beta=\beta_{\min}; which is typically for the Boltzmann distribution (7). The key idea behind tempering is then to select MM values

{β1βmin<β2<β3<<βMβmax}\displaystyle\mathcal{B}\coloneq\{\beta_{1}\coloneq\beta_{\min}<\beta_{2}<\beta_{3}<\cdots<\beta_{M}\coloneq\beta_{\max}\}

and generate samples from the joint distribution

p(uβ:β)βp(uβ;β)\displaystyle p(u^{\beta}:\beta\in\mathcal{B})\coloneq\prod_{\beta\in\mathcal{B}}p(u^{\beta};\beta) (16)

that corresponds to MM independent random variablea 𝒖β{\bm{u}}^{\beta}, one for each parameter value β\beta\in\mathcal{B}. We group these variables as an MM-tuple and denote the joint Markov chain by

𝒖[k](𝒖β[k]:β),\displaystyle{\bm{u}}[k]\coloneq({\bm{u}}^{\beta}[k]:\beta\in\mathcal{B}),

Eventually, from each MM-tuple we only use the samples 𝒖β[k]{\bm{u}}^{\beta}[k], β=βmax\beta=\beta_{\max} that correspond to the desired distribution.

III-C1 General algorithm

Tempering can be applied to any MCMC method associated with a regular Markov chain with transition probabilities p(u|u;β)p(u^{\prime}|u;\beta) and strictly positive transition matrices PβP_{\beta}, β\beta\in\mathcal{B}. The algorithm uses a flip function defined by

fflip(uβj,uβj+1)\displaystyle f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}}) =min{p(uβj+1;βj)p(uβj+1;βj+1)p(uβj;βj+1)p(uβj;βj),1},\displaystyle=\min\Big{\{}\frac{p(u^{\beta_{j+1}};\beta_{j})}{p(u^{\beta_{j+1}};\beta_{j+1})}\frac{p(u^{\beta_{j}};\beta_{j+1})}{p(u^{\beta_{j}};\beta_{j})},1\Big{\}}, (17)

and operates as follows:

Algorithm 2 (Tempering)
  1. 1.

    Pick arbitrary 𝒖[1]=(𝒖β[1]𝒰:β){\bm{u}}[1]=({\bm{u}}^{\beta}[1]\in\mathcal{U}:\beta\in\mathcal{B}).

  2. 2.

    Set k=1k=1 and repeat until enough samples are collected:

    1. (a)

      For each β\beta\in\mathcal{B}, sample 𝒖β[k+1]{\bm{u}}^{\beta}[k+1] with probability p(u|𝒖β[k];β)p(u^{\prime}|{\bm{u}}^{\beta}[k];\beta) and increment kk.

    2. (b)

      Tempering sweep: For each j{1,,M1}j\in\{1,\dots,M-1\}:

      • Compute the flip probability

        pflip=fflip(𝒖βj[k],𝒖βj+1[k])\displaystyle p_{\mathrm{flip}}=f_{\mathrm{flip}}({\bm{u}}^{\beta_{j}}[k],{\bm{u}}^{\beta_{j+1}}[k]) (18)

        and set

        𝒖[k+1]={𝒖~[k]with prob. pflip,𝒖[k]with prob. 1pflip,\displaystyle{\bm{u}}[k+1]=\begin{cases}{\bm{\tilde{u}}}[k]&\text{with prob.~{}}p_{\mathrm{flip}},\\ {\bm{u}}[k]&\text{with prob.~{}}1-p_{\mathrm{flip}},\end{cases}

        where 𝒖~[k]{\bm{\tilde{u}}}[k] is a version of 𝒖[k]{\bm{u}}[k] with the entries 𝒖βj[k]{\bm{u}}^{\beta_{j}}[k] and 𝒖βj+1[k]{\bm{u}}^{\beta_{j+1}}[k] flipped; and increment kk. \Box

Step 2a corresponds to one step of a base MCMC algorithm (e.g., Gibbs sampling), for each value of β\beta\in\mathcal{B}. For the Boltzmann distribution (7), the flip function is given by

fflip(uβj,uβj+1)\displaystyle f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}}) =min{e(βjβj+1)(J(uβj+1)J(uβj)),1},\displaystyle=\min\Big{\{}e^{-(\beta_{j}-\beta_{j+1})(J(u^{\beta_{j+1}})-J(u^{\beta_{j}}))},1\Big{\}},

which means that the variables 𝒖βj[k]{\bm{u}}^{\beta_{j}}[k] and 𝒖βj+1[k]{\bm{u}}^{\beta_{j+1}}[k] are flipped with probability one whenever J(𝒖βj)<J(𝒖βj+1)J({\bm{u}}^{\beta_{j}})<J({\bm{u}}^{\beta_{j+1}}). Intuitively, the tempering sweep in step 2b quickly brings to 𝒖βmax[k]{\bm{u}}^{\beta_{\max}}[k] low-energy/low-cost samples that may have been “discovered” by other 𝒖β[k]{\bm{u}}^{\beta}[k], β<βmax\beta<\beta_{\max} with better mixing.

III-C2 Balance

Since the sample extractions in step 2a are independent, the transition probability corresponding to this step is given by

p((uβ:β)|(uβ:β))=βp(uβ|uβ;β),\displaystyle p\big{(}({u^{\prime}}^{\beta}:\beta\in\mathcal{B})\,|\,(u^{\beta}:\beta\in\mathcal{B})\big{)}=\prod_{\beta\in\mathcal{B}}p({u^{\prime}}^{\beta}|u^{\beta};\beta),

which satisfies the global balance equation for the joint distribution in (16). Picking some j{1,,M1}j\in\{1,\dots,M-1\}, the transition probability for step 2b is given by

p((uβ:β)|(uβ:β))={fflip(uβj,uβj+1)uβj=uβj+1,uβj=uβj+1,uβ=uβ,β{βj,βj+1}0otherwise,p\Big{(}({u^{\prime}}^{\beta}:\beta\in\mathcal{B})\,|\,(u^{\beta}:\beta\in\mathcal{B})\Big{)}=\\ \begin{cases}\scriptstyle f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}})&\scriptstyle{u^{\prime}}^{\beta_{j}}=u^{\beta_{j+1}},\,u^{\beta_{j}}={u^{\prime}}^{\beta_{j+1}},\,{u^{\prime}}^{\beta}=u^{\beta},\,\beta\not\in\{\beta_{j},\beta_{j+1}\}\\ 0&\text{otherwise},\end{cases}

and therefore

p((uβ:β)|(uβ:β))p(uβ:β)={p(uβj;βj)p(uβj+1;βj+1)p(uβ:β{βj,βj+1})fflip(uβj,uβj+1)if uβj=uβj+1,uβj=uβj+1,uβ=uβ,β{βj,βj+1},0otherwise,p\Big{(}({u^{\prime}}^{\beta}:\beta\in\mathcal{B})\,|\,(u^{\beta}:\beta\in\mathcal{B})\Big{)}p(u^{\beta}:\beta\in\mathcal{B})\\ =\begin{cases}\scriptstyle p(u^{\beta_{j}};\beta_{j})p(u^{\beta_{j+1}};\beta_{j+1})p(u^{\beta}:\beta\not\in\{\beta_{j},\beta_{j+1}\})f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}})\\ \quad\text{if~{}}\scriptstyle{u^{\prime}}^{\beta_{j}}=u^{\beta_{j+1}},\;u^{\beta_{j}}={u^{\prime}}^{\beta_{j+1}},{u^{\prime}}^{\beta}=u^{\beta},\;\forall\beta\not\in\{\beta_{j},\beta_{j+1}\},\\ 0\quad\text{otherwise},\end{cases}

whereas

p((uβ:β)|(uβ:β))p(uβ:β)={p(uβj;βj)p(uβj+1;βj+1)p(uβ:β{βj,βj+1})fflip(uβj,uβj+1)if uβj=uβj+1,uβj=uβj+1,uβ=uβ,β{βj,βj+1},0otherwise={p(uβj+1;βj)p(uβj;βj+1)p(uβ:β{βj,βj+1})fflip(uβj+1,uβj)if uβj=uβj+1,uβj=uβj+1,uβ=uβ,β{βj,βj+1},0otherwise.p\Big{(}(u^{\beta}:\beta\in\mathcal{B})\,|\,({u^{\prime}}^{\beta}:\beta\in\mathcal{B})\Big{)}p({u^{\prime}}^{\beta}:\beta\in\mathcal{B})\\ =\begin{cases}\scriptstyle p({u^{\prime}}^{\beta_{j}};\beta_{j})p({u^{\prime}}^{\beta_{j+1}};\beta_{j+1})p({u^{\prime}}^{\beta}:\beta\not\in\{\beta_{j},\beta_{j+1}\})f_{\mathrm{flip}}({u^{\prime}}^{\beta_{j}},{u^{\prime}}^{\beta_{j+1}})\\ \quad\text{if~{}}\scriptstyle u^{\beta_{j}}={u^{\prime}}^{\beta_{j+1}},\;{u^{\prime}}^{\beta_{j}}=u^{\beta_{j+1}},{u^{\prime}}^{\beta}=u^{\beta},\;\forall\beta\not\in\{\beta_{j},\beta_{j+1}\},\\ 0\quad\text{otherwise}\end{cases}\\ =\begin{cases}\scriptstyle p(u^{\beta_{j+1}};\beta_{j})p(u^{\beta_{j}};\beta_{j+1})p(u^{\beta}:\beta\not\in\{\beta_{j},\beta_{j+1}\})f_{\mathrm{flip}}(u^{\beta_{j+1}},u^{\beta_{j}})\\ \quad\text{if~{}}\scriptstyle u^{\beta_{j}}={u^{\prime}}^{\beta_{j+1}},\;{u^{\prime}}^{\beta_{j}}=u^{\beta_{j+1}},{u^{\prime}}^{\beta}=u^{\beta},\;\forall\beta\not\in\{\beta_{j},\beta_{j+1}\},\\ 0\quad\text{otherwise}.\end{cases}

To get detailed balance, we thus need

p(uβj;βj)p(uβj+1;βj+1)fflip(uβj,uβj+1)=p(uβj+1;βj)p(uβj;βj+1)fflip(uβj+1,uβj).p(u^{\beta_{j}};\beta_{j})p(u^{\beta_{j+1}};\beta_{j+1})f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}})\\ =p(u^{\beta_{j+1}};\beta_{j})p(u^{\beta_{j}};\beta_{j+1})f_{\mathrm{flip}}(u^{\beta_{j+1}},u^{\beta_{j}}).

This equality always holds for the flip function defined in (17), for which we always have either fflip(uβj,uβj+1)=1f_{\mathrm{flip}}(u^{\beta_{j}},u^{\beta_{j+1}})=1 or fflip(uβj+1,uβj)=1f_{\mathrm{flip}}(u^{\beta_{j+1}},u^{\beta_{j}})=1. This guarantees detailed balance for all M1M-1 steps in 2b and therefore global balance for all the combined steps in 2a and 2b for a full tempering sweep.

III-C3 Regularity and Convergence

Assuming that for each β\beta\in\mathcal{B}, the Markov chain generated by p(u|u;β)p(u^{\prime}|u;\beta) is regular with a strictly positive transition matrix PβP_{\beta}, any possible combination of states 𝒖[k]=(uβ[k]:β){\bm{u}}[k]=(u^{\beta}[k]:\beta\in\mathcal{B}) at time kk can transition in one time step to any possible combination of states 𝒖[k+1]=(uβ[k]:β){\bm{u}}[k+1]=(u^{\beta}[k]:\beta\in\mathcal{B}) at time k+1k+1 at step 2a. This means that this step corresponds to a transition matrix PP with strictly positive entries. In contrast, each flip step corresponds to a transition matrix PflipjP_{\mathrm{flip}j} that is non-negative and right-stochastic, but typically has many zero entries. However, each matrix PflipjP_{\mathrm{flip}j} cannot have any row that is identically zero (because rows must add up to 1). This suffices to conclude that any product of the form

Q=Pflip1Pflip2PflipM1P\displaystyle Q=P_{\mathrm{flip}1}P_{\mathrm{flip}2}\cdots P_{\mathrm{flip}M-1}P

must have all entries strictly positive. This transition matrix corresponds to the transition from the start of step 2b in one “tempering sweep” to the start of the same step at the next sweep and defines a regular Markov chain. Theorem 1 thus allow us to conclude that the distribution at the start of step 2b converges to the desired invariant distribution. Since every step satisfies global balance, the invariant distribution is preserved across every step, so the distributions after every step also converges to the invariant distribution.

IV MCMC FOR OPTIMAL CONTROL

Our goal is to optimize a criterion of the form (5). In the context of MCMC sampling from the Boltzmann distribution (7), this corresponds to a Gibbs update for the control 𝒖tβ[k+1]{\bm{u}}^{\beta}_{t}[k+1] with a distribution (14) that can be computed using

eβJ(ut,𝒖tβ[k])u¯teβJ(u¯t,𝒖tβ[k])=eβ𝒄tβ[k]A(ut)𝒙tβ[k]u¯teβeβ𝒄tβ[k]A(u¯t)𝒙tβ[k],\displaystyle\frac{e^{-\beta J(u_{t},{\bm{u}}^{\beta}_{-t}[k])}}{\sum_{\bar{u}_{t}}e^{-\beta J(\bar{u}_{t},{\bm{u}}^{\beta}_{-t}[k])}}=\frac{e^{-\beta{\bm{c}}^{\beta}_{t}[k]A(u_{t}){\bm{x}}^{\beta}_{t}[k]}}{\sum_{\bar{u}_{t}}e^{-\beta e^{-\beta{\bm{c}}^{\beta}_{t}[k]A(\bar{u}_{t}){\bm{x}}^{\beta}_{t}[k]}}},

and a tempering flip probability in (18) computed using

pflip\displaystyle p_{\mathrm{flip}} =min{e(βjβj+1)(J(𝒖βj+1[k])J(𝒖βj[k])),1}\displaystyle=\min\Big{\{}e^{-(\beta_{j}-\beta_{j+1})(J({\bm{u}}^{\beta_{j+1}}[k])-J({\bm{u}}^{\beta_{j}}[k]))},1\Big{\}}
=min{e(βjβj+1)c(𝒙Tβj+1[k])𝒙Tβj[k])),1},\displaystyle=\min\Big{\{}e^{-(\beta_{j}-\beta_{j+1})c({\bm{x}}^{\beta_{j+1}}_{T}[k])-{\bm{x}}^{\beta_{j}}_{T}[k]))},1\Big{\}},

where

𝒄tβ[k]\displaystyle{\bm{c}}^{\beta}_{t}[k] cA(𝒖Tβ[k])A(𝒖t+1β[k]),\displaystyle\coloneq cA({\bm{u}}^{\beta}_{T}[k])\cdots A({\bm{u}}^{\beta}_{t+1}[k]), (19)
𝒙tβ[k]\displaystyle{\bm{x}}^{\beta}_{t}[k] A(𝒖t1β[k])A(𝒖1β[k])x1.\displaystyle\coloneq A({\bm{u}}^{\beta}_{t-1}[k])\cdots A({\bm{u}}^{\beta}_{1}[k])x_{1}. (20)

The following algorithm implements Gibbs sampling with tempering using recursive formulas to evaluate (19)–(20).

Algorithm 3 (Tempering for Koopman optimal control)
  1. 1.

    Pick arbitrary 𝒖[1]=(𝒖tβ[1]𝒰:t𝒯,β){\bm{u}}[1]=({\bm{u}}_{t}^{\beta}[1]\in\mathcal{U}:t\in\mathcal{T},\beta\in\mathcal{B}).

  2. 2.

    Set k=1k=1 and repeat until enough samples are collected:

    1. (a)

      Gibbs sampling: For each β\beta\in\mathcal{B}:

      • Set 𝒙1β[k]=x1{\bm{x}}^{\beta}_{1}[k]=x_{1}, 𝒄Tβ[k]=c{\bm{c}}^{\beta}_{T}[k]=c and, t𝒯\forall t\in\mathcal{T},

        𝒄tβ[k]=𝒄t+1β[k]A(𝒖t+1β[k]).\displaystyle{\bm{c}}^{\beta}_{t}[k]={\bm{c}}^{\beta}_{t+1}[k]A({\bm{u}}^{\beta}_{t+1}[k]). (21)
      • Variable sweep: For each t𝒯t\in\mathcal{T}

        • Sample 𝒖tβ[k+1]{\bm{u}}^{\beta}_{t}[k+1] with distribution

          eβ𝒄tβ[k]A(ut)𝒙tβ[k]u¯teβeβ𝒄tβ[k]A(u¯t)𝒙tβ[k],\displaystyle\frac{e^{-\beta{\bm{c}}^{\beta}_{t}[k]A(u_{t}){\bm{x}}^{\beta}_{t}[k]}}{\sum_{\bar{u}_{t}}e^{-\beta e^{-\beta{\bm{c}}^{\beta}_{t}[k]A(\bar{u}_{t}){\bm{x}}^{\beta}_{t}[k]}}}, (22)
        • Set 𝒖tβ[k+1]=𝒖tβ[k]{\bm{u}}^{\beta}_{-t}[k+1]={\bm{u}}^{\beta}_{-t}[k], update

          𝒙t+1β[k+1]=A(𝒖tβ[k+1])𝒙tβ[k],\displaystyle{\bm{x}}^{\beta}_{t+1}[k+1]=A({\bm{u}}^{\beta}_{t}[k+1]){\bm{x}}^{\beta}_{t}[k], (23)

          and increment kk.

    2. (b)

      Tempering sweep: For each j{1,,M1}j\in\{1,\dots,M-1\}

      • Compute the flip probability

        pflip=min{e(βjβj+1)c(𝒙Tβj+1[k])𝒙Tβj[k])),1},\displaystyle p_{\mathrm{flip}}=\min\Big{\{}e^{-(\beta_{j}-\beta_{j+1})c({\bm{x}}^{\beta_{j+1}}_{T}[k])-{\bm{x}}^{\beta_{j}}_{T}[k]))},1\Big{\}},

        and set

        𝒖[k+1]={𝒖~[k]with prob. pflip𝒖[k]with prob. 1pflip,\displaystyle{\bm{u}}[k+1]=\begin{cases}{\bm{\tilde{u}}}[k]&\text{with prob.~{}}p_{\mathrm{flip}}\\ {\bm{u}}[k]&\text{with prob.~{}}1-p_{\mathrm{flip}},\end{cases}

        where 𝒖~[k]{\bm{\tilde{u}}}[k] is a version of 𝒖[k]{\bm{u}}[k] with the entries 𝒖βj[k]{\bm{u}}^{\beta_{j}}[k] and 𝒖βj+1[k]{\bm{u}}^{\beta_{j+1}}[k] flipped; and increment kk.  \Box

Computation complexity and parallelization

The bulk of the computation required by Algorithm 3 lies in the computation of the matrix-vector products that appear in (21), (22), and (23), each of these products requiring O(nψ2)O(n_{\psi}^{2}) floating-point operation for a total of O(MT(2+|𝒰|)nψ2)O(MT(2+|\mathcal{U}|)n_{\psi}^{2}) operations. The tempering step, does not use any additional vector-matrix multiplications. In contrast, a naif implementation of Gibbs sampling with tempering would require MT|𝒰|MT|\mathcal{U}| evaluations of the cost function (5), each with computational complexity O(Tnψ2)O(Tn_{\psi}^{2}). The reduction in computation complexity by a factor of T|𝒰|/(2+|𝒰|)T|\mathcal{U}|/(2+|\mathcal{U}|) is especially significant when the time horizon is large. The price paid for the computational savings is that we need to store the vectors (21), (23), with memory complexity O(MTnψ)O(MTn_{\psi}).

The computations of the matrix-vector products mentioned above are independent across different values of β\beta\in\mathcal{B} and can be performed in parallel. This means that tempering across MM temperatures can be computationally very cheap if enough computational cores and memory are available. In contrast, parallelization across a Gibbs variable sweep is not so easily parallelizable because, within one variable sweep, the value of each sample 𝒖tβ[k+1]{\bm{u}}^{\beta}_{t}[k+1], typically depends on the values of previous samples 𝒖tβ[k]{\bm{u}}^{\beta}_{t}[k] (for the same value of β\beta). Tempering thus both decreases the mixing time of the Markov chain and opens the door for a high degree of parallelization.

V Numerical Example

We tested the algorithm proposed in this paper on a Koopman model for the Atari 2600 Assault video game. The goal of the game is to protect earth from small attack vessels deployed by an alien mothership. The mother ship and attack vessels shoot at the player’s ship and the player uses a joystick to dodge the incoming fire and fire back at the alien ships. The player’s ship is destroyed either if it is hit by enemy fire or if it “overheats” by shooting at the aliens. The player earns points by destroying enemy ships.

We used a Koopman model with |𝒰|=4|\mathcal{U}|=4 control action, which correspond to “move left”, “move right”, “shoot up”, and do nothing. The optimization minimizes the cost

J(u){1Tt=1Trtif player’s ship not destroyed+101Tt=1Trtif player’s ship destroyed\displaystyle J(u)\coloneq\begin{cases}-\frac{1}{T}\sum_{t=1}^{T}r_{t}&\text{if player's ship not destroyed}\\ +10-\frac{1}{T}\sum_{t=1}^{T}r_{t}&\text{if player's ship destroyed}\end{cases}

where rtr_{t} denotes the points earned by the player at time tt. This cost balances the tradeoff between taking some risk to collect points to decrease 1Tt=1Trt-\frac{1}{T}\sum_{t=1}^{T}r_{t}, while not getting destroyed and incurring the +10+10 penalty. For game play, this optimization is solved at every time step with a receding horizon starting at the current time tt and ending at time t+Tt+T, with only the first control action executed. However, because the focus of this paper is on the solution to (1)–(2), we present results for a single optimization starting from a typical initial condition. A time horizon T=40T=40 was used in this section, which corresponds to a total number of control options |𝒰|T1.2×1024|\mathcal{U}|^{T}\approx 1.2\times 10^{24}.

The state of the system is built directly from screen pixel information. Specifically, the pixels are segmented into 5 categories corresponding to the player’s own ship, the player’s horizontal fire, the player’s vertical fire, the attacker’s ships and fire, and the temperature bar. For the player’s own ship and the attacker’s ships/fire, we consider pixel information from the current and last screenshot, so that we have “velocity” information. The pixels of each category are used to construct “spatial densities” using the entity-based approach described in [7], with observables of the form

φj(x)ieλpicj2,\displaystyle\varphi_{\ell j}(x)\coloneq\sum_{i}e^{-\lambda\|p_{\ell i}-c_{\ell j}\|^{2}}, (24)

where the index \ell ranges over the 5 categories above, the summation is taken over the pixels pip_{\ell i} associated with category \ell, and the cjc_{\ell j} are fixed points in the screen. The densities associated with the 5 categories are represented by 50, 50, 100, 200, 16 points cjc_{\ell j}, respectively. The observables in (24), together with the value of the optimization criterion, are used to form a lifted state ψt\psi_{t} with dimension nψ=667n_{\psi}=667 (see Table I). The matrices A(u)A(u) in (5) were estimated using 500 game traces using random inputs. Collecting data from the Atari simulator and lifting the state took about 1h45min, whereas solving the least-squares problems needed to obtain the Koopman matrices took less than 1 sec.

TABLE I: State of the Koopman model
Description dimension
player’s ship current pixels 50
player’s ship last pixels 50
player’s horizontal fire current pixels 50
player’s vertical fire current pixels 100
attacker’s ships current pixels 200
attacker’s ships last pixels 200
temperature bar current pixels 16
optimization criterion 1
Koopman state dimension nψn_{\psi} 667
Refer to caption
Figure 1: Typical run of Algorithm 3: The first 5 plots show the cost (yy-axis) at the end of each iteration (xx-axis), as well as the minimum cost found so far, for a specific temperature β\beta. For this run, 12 logarithmic scaled temperatures were used, but only 5 are shown here. In each plot, red and blue dots mark iterations where flips occurred during the tempering sweep. The bottom-right plot shows the total number of flips across “adjacent” temperatures (in the xx-axis), as a percentage of the total number of iterations (in the yy-axis).

Figure 1 depicts a typical run of Algorithm 3, showing flipping of samples across adjacent temperatures in 40-80% of the tempering sweeps. In the remainder of this section we compare the performance of Algorithm 3 with several alternatives. All run times refer to Julia implementations on a 2018 MacBook Pro with a 2.6GHz Intel Core i7 CPU.

Figure 2 compares Algorithm 3 with the algorithm in [7], which exploits the piecewise-linear structure of the cost-to-go to efficiently represent and evaluate the value function and also to dynamically prune the search tree. Due to the need for exploration, this algorithm “protects” from pruning a random fraction of tree-branches (see [7] for details). Both algorithms used 6 CPU cores. For Algorithm 3, each core executed one sweep of the tempering algorithm and for the algorithm in [7], the 6 cores were used by BLAS to speedup matrix multiplication. Both algorithms were executed multiple times and the plots show the costs obtained as a function of run time. For Algorithm 3, the run time is controlled by the number of samples drawn. For the algorithm in [7], the run time is controlled by the number of vectors used to represent the value function. Both algorithms eventually discover comparable “optimal” solutions, but Algorithm 3 consistently finds a lower cost faster.

Refer to caption
Figure 2: Run-time comparison between Algorithm 3 (purple) and the dynamic programming algorithm in [7] (green). The xx-axis denotes run-time and the yy-axis the cost achieved, with the solid lines showing the average cost across 15 different runs and the shaded areas the whole range of costs obtained over those runs. The different plots correspond to different values for the number of temperatures MM.
Refer to caption
Refer to caption
Figure 3: Run-time comparison between Algorithm 3 with 18 temperatures (purple) and a gradient descent algorithm (top, orange) and a genetic algorithm (bottom, blue). The meaning of the solid lines and shaded areas is the same as in Figure 2.

The top plot in Figure 3 compares Algorithm 3 with a gradient descent solver that minimizes the following continuous relaxation of the cost (5):

Jrelax(μ)cψT,\displaystyle J_{\mathrm{relax}}(\mu)\coloneq c\psi_{T}, ψt+1=(u𝒰μu(t)A(u))ψt\displaystyle\psi_{t+1}=\Big{(}\sum_{u\in\mathcal{U}}\mu_{u}(t)A(u)\Big{)}\psi_{t} (25)

where the optimization variables μu(t)[0,1]\mu_{u}(t)\in[0,1] are required to satisfy u𝒰μu(t)=1\sum_{u\in\mathcal{U}}\mu_{u}(t)=1, t𝒯\forall t\in\mathcal{T}. The global minimum JrelaxJ_{\mathrm{relax}}^{*} of (25) would match that of (5), if we forced the μu(t)\mu_{u}(t) to take binary values in {0,1}\{0,1\}. Otherwise, JrelaxJ_{\mathrm{relax}}^{*} provides a lower bound for (6). We minimized (25) using the toolbox [2]. The results shown were obtained using Nesterov-accelerated adaptive moment estimation [2], with η=0.5\eta=0.5, which resulted in the best performance among the algorithms supported by [2]. The constraints on μu(t)\mu_{u}(t) were enforced by minimum-distance projection into the constraint set. In general, gradient descent converges quickly, but to a local minima of (25) with cost higher then the minimum found by Algorithm 3 for (6); in spite of the fact that the global minima of (25) is potentially smaller than that of (6).

The bottom plot in Figure 3 compares Algorithm 3 with a genetic optimization algorithm that also resorts to stochastic exploration. This type of algorithm simulates a “population” of candidate solutions to the optimization (6), which evolves by mutation, crossover, and selection. We minimized (25) using the toolbox [1]. The results in Figure 3 used a population size P=50P=50, selection based on uniform ranking (which selects the best μ=2\mu=2 individuals with probability 1/μ1/\mu), binary crossover (which combines the solution of the two parent with a single crossover point), a mutation rate of 10%, and a probability of mutation for each “gene” of 5%. These parameters resulted in the lowest costs we could achieve among the options provided by [1], but still significantly higher than the costs obtained with Algorithm 3.

References

  • [1] https://github.com/wildart/Evolutionary.jl.
  • [2] https://github.com/jacobcvt12/GradDescent.jl.
  • Aadit et al. [2022] N. Aadit, A. Grimaldi, M. Carpentieri, L. Theogarajan, J. Martinis, G. Finocchio, and K. Y. Camsari. Massively parallel probabilistic computing with sparse Ising machines. Nature Electronics, 5(7):460–468, 2022.
  • Bakker et al. [2019] C. Bakker, K. E. Nowak, and W. S. Rosenthal. Learning Koopman operators for systems with isolated critical points. In Proc. of the 58th IEEE Conf. on Decision and Contr., pages 7733–7739, 2019.
  • Bengea and DeCarlo [2005] S. C. Bengea and R. A. DeCarlo. Optimal control of switching systems. Automatica, 41(1):11–27, 2005.
  • Bevanda et al. [2021] P. Bevanda, S. Sosnowski, and S. Hirche. Koopman operator dynamical models: Learning, analysis and control. Annual Reviews in Control, 52:197–212, 2021.
  • Blischke and Hespanha [2023] M. Blischke and J. P. Hespanha. Learning switched Koopman models for control of entity-based systems. In Proc. of the 62th IEEE Conf. on Decision and Contr., Dec. 2023.
  • Brooks [1998] S. Brooks. Markov chain Monte Carlo method and its application. J. of the Royal Statistical Society: series D (the Statistician), 47(1):69–100, 1998.
  • Brunton et al. [2016] S. Brunton, B. Brunton, J. Proctor, and J. N. Kutz. Koopman invariant subspaces and finite linear representations of nonlinear dynamical systems for control. PloS one, 11(2), 2016.
  • Černỳ [1985] V. Černỳ. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Opt. Theory and Applications, 45:41–51, 1985.
  • Earl and Deem [2005] D. J. Earl and M. W. Deem. Parallel tempering: Theory, applications, and new perspectives. Physical Chemistry Chemical Physics, 7(23):3910–3916, 2005.
  • Folkestad and Burdick [2021] C. Folkestad and J. W. Burdick. Koopman NMPC: Koopman-based learning and nonlinear model predictive control of control-affine systems. In Proc. of the IEEE Int. Conf. on Robot. and Automat. (ICRA), pages 7350–7356, 2021.
  • Gelfand and Smith [1990] A. E. Gelfand and A. F. Smith. Sampling-based approaches to calculating marginal densities. J. of the American Statistical Assoc., 85(410):398–409, 1990.
  • Geman and Geman [1984] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. on Pattern Anal. and Machine Intell., PAMI-6(6):721–741, 1984.
  • Geyer [1991] C. J. Geyer. Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proc. of the 23rd Symp. on the Interface, pages 156–163, 1991.
  • Haseli and Cortés [2023] M. Haseli and J. Cortés. Modeling nonlinear control systems via Koopman control family: Universal forms and subspace invariance proximity, 2023. arXiv: 2307.15368.
  • Hukushima and Nemoto [1996] K. Hukushima and K. Nemoto. Exchange Monte Carlo method and application to spin glass simulations. J. of the Physical Society of Japan, 65(6):1604–1608, 1996.
  • Kemeny and Snell [1976] J. G. Kemeny and J. L. Snell. Finite Markov chains. Springer-Verlag, 1976.
  • Kirkpatrick et al. [1983] S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.
  • Koopman [1931] B. O. Koopman. Hamiltonian systems and transformation in Hilbert space. Proc. of the National Academy of Sciences U.S.A, 17(5):315–318, 1931.
  • Korda and Mezić [2018] M. Korda and I. Mezić. Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control. Automatica, 93:149–160, 2018.
  • Liu et al. [2023] Z. Liu, N. Ozay, and E. D. Sontag. Properties of immersions for systems with multiple limit sets with implications to learning Koopman embeddings, 2023. arXiv:2312.17045.
  • Mauroy et al. [2020] A. Mauroy, Y. Susuki, and I. Mezić. Koopman operator in systems and control. Springer, 2020.
  • Mezić [2005] I. Mezić. Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynamics, 41:309–325, 2005.
  • Mohseni et al. [2022] N. Mohseni, P. L. McMahon, and T. Byrnes. Ising machines as hardware solvers of combinatorial optimization problems. Nature Reviews Physics, 4(6):363–379, 2022.
  • Nüske et al. [2023] F. Nüske, S. Peitz, F. Philipp, M. Schaller, and K. Worthmann. Finite-data error bounds for Koopman-based prediction and control. Journal of Nonlinear Science, 33(1):14, 2023.
  • Otto and Rowley [2021] S. E. Otto and C. W. Rowley. Koopman operators for estimation and control of dynamical systems. Annual Review of Control, Robotics, and Autonomous Systems, 4:59–87, 2021.
  • Proctor et al. [2018] J. Proctor, S. Brunton, and J. N. Kutz. Generalizing Koopman theory to allow for inputs and control. SIAM J. on Applied Dynamical Syst., 17(1):909–930, 2018.
  • Rothblum and Tan [1985] U. G. Rothblum and C. P. Tan. Upper bounds on the maximum modulus of subdominant eigenvalues of nonnegative matrices. Linear algebra and its applications, 66:45–86, 1985.
  • Swendsen and Wang [1986] R. H. Swendsen and J.-S. Wang. Replica Monte Carlo simulation of spin-glasses. Physical Review Lett., 57(21):2607, 1986.
  • Xu and Antsaklis [2004] X. Xu and P. Antsaklis. Optimal control of switched systems based on parameterization of the switching instants. IEEE Trans. on Automat. Contr., 49(1):2–16, 2004.