This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Path Structured Multimarginal Schrödinger Bridge for Probabilistic
Learning of Hardware Resource Usage by Control Software

Georgiy A. Bondar, Robert Gifford, Linh Thi Xuan Phan, Abhishek Halder Georgiy A. Bondar is with the Department of Applied Mathematics, University of California, Santa Cruz, CA 95064, USA, [email protected].
Robert Gifford and Linh Thi Xuan Phan are with the Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA, {rgif,linhphan}@seas.upenn.edu.
Abhishek Halder is with the Department of Aerospace Engineering, Iowa State University, Ames, IA 50011, USA, [email protected].
This work was supported by NSF grants 2112755, 2111688 and 1750158.
Abstract

Solution of the path structured multimarginal Schrödinger bridge problem (MSBP) is the most-likely measure-valued trajectory consistent with a sequence of observed probability measures or distributional snapshots. We leverage recent algorithmic advances in solving such structured MSBPs for learning stochastic hardware resource usage by control software. The solution enables predicting the time-varying distribution of hardware resource availability at a desired time with guaranteed linear convergence. We demonstrate the efficacy of our probabilistic learning approach in a model predictive control software execution case study. The method exhibits rapid convergence to an accurate prediction of hardware resource utilization of the controller. The method can be broadly applied to any software to predict cyber-physical context-dependent performance at arbitrary time.

I Introduction

Control software in safety-critical cyber-physical systems (CPS) is often designed and verified based on platform models that do not fully capture the complexity of its deployment settings. For example, it is common to assume that the processor always operates at full speed, is dedicated to the control software, and that overheads are negligible. In practice, the hardware resources – such as last-level shared cache (LLC), memory bandwidth and processor cycles – often vary with time and depend on the current hardware state, which is a reason why we observe different execution times across different runs of the same control software [1]. This gap can lead to overly costly or unsafe design.

Measurement-based approaches and overhead-aware analysis can reduce the analysis pessimism or ensure safety [2]. The recent work [3] uses fine-grained profiles of the software execution on an actual platform to make dynamic scheduling and resource allocations. Supervisory algorithms that dynamically switch among a bank of controllers – all provably safe but some computationally more benign (and less performant) than others – depending on the resource availability also exist [4]. However, the effectiveness of these techniques is contingent on the quality of prediction of the hardware resource availability at a future instance or time horizon of interest.

In this work, we propose to predict the resource usage by control software based on just a very small set of measurements. This approach is attractive as it can reduce measurement efforts while better handling potential variances.

A first-principle predictive model for hardware resource availability based on semiconductor physics of the specific platform is, however, unavailable. Furthermore, resources such as cache and bandwidth are not only time-varying and stochastic, but they are also statistically correlated. This makes it challenging to predict the joint stochastic variability of the hardware resource availability in general. The challenge is even more pronounced for control software because the computational burden then also depends on additional context, e.g., reference trajectory that the controller is tracking.

We note that for safety-critical CPS, predicting the joint stochastic hardware resource state, as opposed to predicting a lumped variable such as worst-case execution time, can open the door for designing a new class of dynamic scheduling algorithms with better performance than what is feasible today while minimizing hardware cost.

This work proposes learning a joint stochastic process for hardware resource availability from control software execution profiles conditioned on CPS contexts (to be made precise in Sec. III-A, III-B). Our proposed method leverages recent advances in stochastic control – specifically in the multimarginal Schrödinger bridge (MSBP) – to allow prediction of time-varying joint statistical distributions of hardware resource availability at any desired time.

Contributions

Our specific contributions are as follows.

  • We show how recent algorithmic developments in solving the MSBP, enable probabilistic learning of hardware resources. This advances the state-of-the-art at the intersection of control, learning and real-time systems.

  • The proposed method is statistically nonparametric, and is suitable for high-dimensional joint prediction since it avoids gridding the hardware feature/state space.

  • The proposed formulation provably predicts the most likely distribution given a sequence of distributional snapshots for the hardware resource state.

  • We explain that the resulting algorithm is an instance of the multimarginal Sinkhorn iteration with path structured cost that is guaranteed to converge to a unique solution, and enjoys linear rate of convergence. Its computational complexity scales linearly w.r.t. dimensions, linearly w.r.t. number of distributional snapshots, and quadratically w.r.t. number of scattered samples.

Organization

We introduce the notations and preliminaries in Sec. II. The problem formulation is detailed in Sec. III. Sec. IV summarizes the overall methodology, which is then followed by a numerical case study in Sec. V. Concluding remarks are provided in Sec. VI.

II Notations and Preliminaries

We use unboldfaced capital letters to denote matrices and bold capital letters to denote tensors (of order three or more). Unboldfaced (resp. boldfaced) small letters are used to denote scalars (resp. vectors). Capital calligraphic letters are reserved to denote sets.

Square braces are used to denote the components. For instance, [𝑿i1,,ir]\left[\bm{X}_{i_{1},\ldots,i_{r}}\right] denotes the (i1,,ir)(i_{1},\ldots,i_{r})th component of the order rr tensor 𝑿\bm{X}, where (i1,,ir)r(i_{1},\ldots,i_{r})\in\mathbb{N}^{r}. We use the rr fold tensor product space notation (d)r:=ddrtimes\left(\mathbb{R}^{d}\right)^{\otimes r}:=\underbrace{\mathbb{R}^{d}\otimes\ldots\otimes\mathbb{R}^{d}}_{r\;\text{times}}.

For two tensors 𝑿,𝒀\bm{X},\bm{Y} of order rr, we define their Hilbert-Schmidt inner product as

𝑿,𝒀:=i1,,ir[𝑿i1,,ir][𝒀i1,,ir].\displaystyle\langle\bm{X},\bm{Y}\rangle:=\sum_{i_{1},\ldots,i_{r}}\left[\bm{X}_{i_{1},\ldots,i_{r}}\right]\left[\bm{Y}_{i_{1},\ldots,i_{r}}\right]. (1)

The operators exp()\exp(\cdot) and log()\log(\cdot) are understood elementwise. We use \odot and \oslash to denote elementwise (Hadamard) multiplication and division, respectively.

For measures μ,ν\mu,\nu defined on two Polish spaces, their product measure is denoted by μν\mu\otimes\nu. The relative entropy a.k.a. Kullback-Leibler divergence DKL()D_{\rm{KL}}(\cdot\|\cdot) between probability measures μ\mu and ν\nu is

DKL(μν):={logdμdνdμifμν,+otherwise,\displaystyle D_{\rm{KL}}(\mu\|\nu):=\begin{cases}\int\log\frac{{\rm{d}}\mu}{{\rm{d}}\nu}{\rm{d}}\mu&\text{if}\quad\mu\ll\nu,\\ +\infty&\text{otherwise,}\end{cases} (2)

where dμdν\frac{{\rm{d}}\mu}{{\rm{d}}\nu} denotes the Radon-Nikodym derivative, and μν\mu\ll\nu is a shorthand for “μ\mu is absolutely continuous w.r.t. ν\nu”.

The Hilbert (projective) metric (see e.g., [5]) dH(𝒖,𝒗)d_{\rm{H}}\left(\bm{u},\bm{v}\right) between two vectors 𝒖,𝒗>0n\bm{u},\bm{v}\in\mathbb{R}^{n}_{>0} is

dH(𝒖,𝒗)=log(maxi=1,,nui/vimini=1,,nui/vi).\displaystyle d_{\rm{H}}\left(\bm{u},\bm{v}\right)=\log\left(\dfrac{\max_{i=1,\ldots,n}u_{i}/v_{i}}{\min_{i=1,\ldots,n}u_{i}/v_{i}}\right). (3)

We use the term “control cycle” to mean one pass of a feedback control loop. Due to hardware stochasticity, each control cycle completion takes variable amount of time.

III Problem Formulation

III-A Context 𝐜\bm{c}

We consider a context vector 𝒄\bm{c} comprised of separable cyber and physical context vectors

𝒄:=(𝒄cyber𝒄phys).\displaystyle\bm{c}:=\begin{pmatrix}\bm{c}_{\rm{cyber}}\\ \bm{c}_{\rm{phys}}\end{pmatrix}. (4)

In this work, we consider an instance of (4) where

𝒄cyber=(allocated last-level cacheallocated memory bandwidth),\displaystyle\bm{c}_{\rm{cyber}}=\begin{pmatrix}{\text{allocated last-level cache}}\\ {\text{allocated memory bandwidth}}\end{pmatrix}, (5)

where both features are allocated in blocks of some size, and

𝒄phys=ydes(x)GP([xmin,xmax]),\displaystyle\bm{c}_{\rm{phys}}=y_{\rm{des}}(x)\in{\rm{GP}}\left([x_{\min},x_{\max}]\right), (6)

where GP{\rm{GP}} denotes a Gaussian process over the domain [xmin,xmax][x_{\min},x_{\max}]. We work with a collection of contexts with cardinality ncontextn_{\rm{context}}, i.e., a sample of contexts {𝒄i}i=1ncontext\{\bm{c}^{i}\}_{i=1}^{n_{\rm{context}}}.

III-B Hardware Resource State 𝛏\bm{\xi}

For concreteness, we define a hardware resource state or feature vector used in our numerical case study (Sec. V):

𝝃:=(ξ1ξ2ξ3)=(instructions retiredLLC requestsLLC misses).\displaystyle\bm{\xi}:=\begin{pmatrix}\xi_{1}\\ \xi_{2}\\ \xi_{3}\end{pmatrix}=\begin{pmatrix}\text{instructions retired}\\ \text{LLC requests}\\ \text{LLC misses}\end{pmatrix}. (7)

The three elements of 𝝃\bm{\xi} denote the number of CPU instructions, the number of LLC requests, and the number of LLC misses in the last time unit (10 ms in our profiling), respectively.

We emphasize that our proposed method is not limited by what specific components comprise 𝝃\bm{\xi}. To highlight this flexibility, we describe the proposed approach for 𝝃d\bm{\xi}\in\mathbb{R}^{d} with suitable interpretations for the specific application.

For a time interval [0,t][0,t] of interest, we think of time-varying 𝝃\bm{\xi} as a continuous time vector-valued stochastic process over subsets of d\mathbb{R}^{d}. Suppose that s,s2s\in\mathbb{N},s\geq 2 snapshots or observations are made for the stochastic state 𝝃(τ)\bm{\xi}(\tau), 0τt0\leq\tau\leq t, at (possibly non-equispaced) instances

τ10<τ2<<τs1<τst.\tau_{1}\equiv 0<\tau_{2}<\ldots<\tau_{s-1}<\tau_{s}\equiv t.

Consider the snapshot index set s:={1,,s}\llbracket s\rrbracket:=\{1,\ldots,s\}. For a fixed context 𝒄\bm{c}, the snapshot observations comprise a sequence of joint probability measures {μσ}σs\{\mu_{\sigma}\}_{\sigma\in\llbracket s\rrbracket} satisfying dμσ(𝝃(τσ))=1\int{\rm{d}}\mu_{\sigma}(\bm{\xi}(\tau_{\sigma}))=1. In other words,

𝝃(τσ)μσσs.\displaystyle\bm{\xi}(\tau_{\sigma})\sim\mu_{\sigma}\quad\forall\sigma\in\llbracket s\rrbracket. (8)

In our application, the data {μσ}σs\{\mu_{\sigma}\}_{\sigma\in\llbracket s\rrbracket} comes from control software execution profiles, i.e., by executing the same control software for the same 𝒄\bm{c} with all parameters and initial conditions fixed. So the stochasticity in 𝝃(τσ)\bm{\xi}(\tau_{\sigma}) stems from the dynamic variability in hardware resource availability.

In particular, for finitely many (say nn) execution profiles, we consider empirical distributions

μσ:=1ni=1nδ(𝝃𝝃i(τσ)),\displaystyle\mu_{\sigma}:=\dfrac{1}{n}\sum_{i=1}^{n}\delta(\bm{\xi}-\bm{\xi}^{i}(\tau_{\sigma})), (9)

where δ(𝝃𝝃i(τσ))\delta(\bm{\xi}-\bm{\xi}^{i}(\tau_{\sigma})) denotes the Dirac delta at sample location 𝝃i(τσ)\bm{\xi}^{i}(\tau_{\sigma}) where ini\in\llbracket n\rrbracket, σs\sigma\in\llbracket s\rrbracket. At any snapshot index σs\sigma\in\llbracket s\rrbracket, the set {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} is scattered data.

Given the data (8)-(9), we would like to predict the most likely hardware resource state statistics

𝝃(τ)μτfor anyτ[0,t].\displaystyle\bm{\xi}(\tau)\sim\mu_{\tau}\quad\text{for any}\;\tau\in[0,t]. (10)

Without the qualifier “most likely”, the problem is overdetermined since there are uncountably many measure-valued continuous curves over [0,t][0,t] that are consistent with the observed data (8)-(9).

Refer to caption
Figure 1: The path tree for sequentially observed {μσ}σs\{\mu_{\sigma}\}_{\sigma\in\llbracket s\rrbracket}.

III-C Multimarginal Schrödinger Bridge

Let 𝒳σ:=support(μσ)d\mathcal{X}_{\sigma}:={\rm{support}}\left(\mu_{\sigma}\right)\subseteq\mathbb{R}^{d} σs\forall\sigma\in\llbracket s\rrbracket, and consider the Cartesian product 𝒳1×𝒳2××𝒳s=:𝓧(d)s\mathcal{X}_{1}\times\mathcal{X}_{2}\times\ldots\times\mathcal{X}_{s}=:\bm{\mathcal{X}}\subseteq\left(\mathbb{R}^{d}\right)^{\otimes s}. Let (𝒳σ)\mathcal{M}\left(\mathcal{X}_{\sigma}\right) and (𝓧)\mathcal{M}\left(\bm{\mathcal{X}}\right) denote the collection (i.e., manifold) of probability measures on 𝒳σ\mathcal{X}_{\sigma} and 𝓧\bm{\mathcal{X}}, respectively. Define a ground cost 𝑪:𝓧0\bm{C}:\bm{\mathcal{X}}\mapsto\mathbb{R}_{\geq 0}.

Following [6, Sec. 3], let

d𝝃σ\displaystyle{\rm{d}}\bm{\xi}_{-\sigma} :=d𝝃(τ1)××d𝝃(τσ1)×d𝝃(τσ+1)××d𝝃(τs),\displaystyle:={\rm{d}}\bm{\xi}(\tau_{1})\times\ldots\times{\rm{d}}\bm{\xi}(\tau_{\sigma-1})\times{\rm{d}}\bm{\xi}(\tau_{\sigma+1})\times\ldots\times{\rm{d}}\bm{\xi}(\tau_{s}), (11a)
𝓧σ\displaystyle\bm{\mathcal{X}}_{-\sigma} :=𝒳1××𝒳σ1×𝒳σ+1××𝒳s.\displaystyle:=\mathcal{X}_{1}\times\ldots\times\mathcal{X}_{\sigma-1}\times\mathcal{X}_{\sigma+1}\times\ldots\times\mathcal{X}_{s}. (11b)

For ε0\varepsilon\geq 0 (not necessarily small), the multimarginal Schrödinger bridge problem (MSBP) is the following infinite dimensional convex program:

min𝑴(𝓧)𝓧{𝑪(𝝃(τ1),,𝝃(τs))+εlog𝑴(𝝃(τ1),,𝝃(τs))}\displaystyle\underset{\bm{M}\in\mathcal{M}\left(\bm{\mathcal{X}}\right)}{\min}\!\int_{\bm{\mathcal{X}}}\!\big{\{}\bm{C}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right)+\varepsilon\log\bm{M}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right)\!\!\big{\}}
𝑴(𝝃(τ1),,𝝃(τs))d𝝃(τ1)d𝝃(τs)\displaystyle\qquad\qquad\qquad\bm{M}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right){\rm{d}}\bm{\xi}(\tau_{1})\ldots{\rm{d}}\bm{\xi}(\tau_{s}) (12a)
subject to𝓧σ𝑴(𝝃(τ1),,𝝃(τs))d𝝃σ=μσσs.\displaystyle\text{subject to}\int_{\bm{\mathcal{X}}_{-\sigma}}\!\!\!\!\!\!\bm{M}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right){\rm{d}}\bm{\xi}_{-\sigma}=\mu_{\sigma}\,\forall\sigma\in\llbracket s\rrbracket. (12b)

In particular, (𝓧)\mathcal{M}(\bm{\mathcal{X}}) is a convex set. The objective (12a) is strictly convex in 𝑴\bm{M}, thanks to the ε\varepsilon-regularized negative entropy term 𝓧ε𝑴log𝑴\int_{\bm{\mathcal{X}}}\varepsilon\bm{M}\log\bm{M}. The constraints (12b) are linear.

In this work, the measures {μσ}σs\{\mu_{\sigma}\}_{\sigma\in\llbracket s\rrbracket} correspond to sequential observation, and we therefore fix the path structured (Fig. 1) ground cost

𝑪(𝝃(τ1),,𝝃(τs))=σ=1s1cσ(𝝃(τσ),𝝃(τσ+1)).\displaystyle\bm{C}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right)=\sum_{\sigma=1}^{s-1}c_{\sigma}\left(\bm{\xi}(\tau_{\sigma}),\bm{\xi}(\tau_{\sigma+1})\right). (13)

In particular, we choose the squared Euclidean distance sequential cost between two consecutive snapshot indices, i.e., cσ(,):=22c_{\sigma}(\cdot,\cdot):=\|\cdot-\cdot\|_{2}^{2} σs\forall\sigma\in\llbracket s\rrbracket. MSBPs with more general tree structured ground costs have appeared in [7].

When the cardinality of the index set s\llbracket s\rrbracket equals 22, then (12) reduces to the (bi-marginal) Schrödinger bridge problem (SBP) [8, 9]. In this case, the solution of (12) gives the most likely evolution between two marginal snapshots μ1,μ2\mu_{1},\mu_{2}. This can be established via the large deviations [10] interpretation [11, Sec. II] of SBP using Sanov’s theorem [12]; see also [13, Sec. 2.1].

Specifically, let 𝒞([τ1,τ2],d)\mathcal{C}\left([\tau_{1},\tau_{2}],\mathbb{R}^{d}\right) denote the collection of continuous functions on the time interval [τ1,τ2][\tau_{1},\tau_{2}] taking values in d\mathbb{R}^{d}. Let Π(μ1,μ2)\Pi(\mu_{1},\mu_{2}) be the collection of all path measures on 𝒞([τ1,τ2],d)\mathcal{C}\left([\tau_{1},\tau_{2}],\mathbb{R}^{d}\right) with time τ1\tau_{1} marginal μ1\mu_{1}, and time τ2\tau_{2} marginal μ2\mu_{2}. Given a symmetric ground cost (e.g., Euclidean distance) C:𝒳1×𝒳20C:\mathcal{X}_{1}\times\mathcal{X}_{2}\mapsto\mathbb{R}_{\geq 0}, let

K(,):=exp(C(,)ε),\displaystyle K(\cdot,\cdot):=\exp\left(-\dfrac{C(\cdot,\cdot)}{\varepsilon}\right), (14)

and consider the bimarginal Gibbs kernel

K(𝝃(τ1),𝝃(τ2))μ1μ2.\displaystyle K\left(\bm{\xi}(\tau_{1}),\bm{\xi}(\tau_{2})\right)\mu_{1}\otimes\mu_{2}. (15)

Then, the bimarginal SBP solves

minπΠ(μ1,μ2)εDKL(πK(𝝃(τ1),𝝃(τ2))μ1μ2),\displaystyle\underset{\pi\in\Pi(\mu_{1},\mu_{2})}{\min}\varepsilon D_{\rm{KL}}\left(\pi\|K\left(\bm{\xi}(\tau_{1}),\bm{\xi}(\tau_{2})\right)\mu_{1}\otimes\mu_{2}\right), (16)

i.e., the most likely evolution of the path measure consistent with the observed measure-valued snapshots μ1,μ2\mu_{1},\mu_{2}.

Under the stated assumptions on the ground cost cc, the existence of minimizer for (16) is guaranteed [14, 15]. The uniqueness of minimizer follows from strict convexity of the map πDKL(πν)\pi\mapsto D_{\rm{KL}}(\pi\|\nu) for fixed ν\nu.

This relative entropy reformulation, and thereby “the most likely evolution consistent with observed measures” interpretation, also holds for the MSBP (12) with s2s\geq 2 snapshots. Specifically, for 𝑪:𝓧0\bm{C}:\bm{\mathcal{X}}\mapsto\mathbb{R}_{\geq 0} as in (12)-(13), we generalize (14) as

𝑲(𝝃(τ1),,𝝃(τs)):=exp(𝑪(𝝃(τ1),,𝝃(τs))ε),\displaystyle\bm{K}\!\left(\!\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\!\right):=\exp\!\left(\!\!-\dfrac{\bm{C}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\!\right)}{\varepsilon}\!\!\right), (17)

and define the multimarginal Gibbs kernel

𝑲(𝝃(τ1),,𝝃(τs))μ1μs.\displaystyle\bm{K}\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right)\!\mu_{1}\otimes\ldots\otimes\mu_{s}. (18)

Problem (16) then generalizes to

minπΠ(μ1,,μs)εDKL(π𝑲(𝝃(τ1),,𝝃(τs))μ1μs)\displaystyle\underset{\pi\in\Pi(\mu_{1},\ldots,\mu_{s})}{\min}\varepsilon D_{\rm{KL}}\left(\pi\|\bm{K}\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right)\mu_{1}\otimes\ldots\otimes\mu_{s}\right) (19)

where Π(μ1,,μs)\Pi(\mu_{1},\ldots,\mu_{s}) denotes the collection of all path measures on 𝒞([τ1,τs],d)\mathcal{C}\left([\tau_{1},\tau_{s}],\mathbb{R}^{d}\right) with time τσ\tau_{\sigma} marginal μσ\mu_{\sigma} σs\forall\sigma\in\llbracket s\rrbracket. The equivalence between (12) and (19) can be verified by direct computation. Thus solving (19), or equivalently (12), yields the most likely evolution of the path measure consistent with the observed measure-valued snapshots μσ\mu_{\sigma} σs\forall\sigma\in\llbracket s\rrbracket.

We propose to solve the MSBP (12) for learning the time-varying statistics of the hardware resource state 𝝃\bm{\xi} as in (10). We next detail a discrete formulation to numerically solve the same for scattered data {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} where nn is the number of control software execution profiles.

The minimizer of (12), 𝑴opt(𝝃(τ1),,𝝃(τs))\bm{M}_{\rm{opt}}\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right) can be used to compute the optimal coupling between snapshot index pairs (σ1,σ2){s2σ1<σ2}(\sigma_{1},\sigma_{2})\in\{\llbracket s\rrbracket^{\otimes 2}\mid\sigma_{1}<\sigma_{2}\} as

𝓧σ1,σ2𝑴opt(𝝃(τ1),,𝝃(τs))d𝝃σ1,σ2\displaystyle\int_{\bm{\mathcal{X}}_{-\sigma_{1},-\sigma_{2}}}\!\!\!\!\!\!\bm{M}_{\rm{opt}}\!\left(\bm{\xi}(\tau_{1}),\ldots,\bm{\xi}(\tau_{s})\right){\rm{d}}\bm{\xi}_{-\sigma_{1},-\sigma_{2}} (20)

where

d𝝃σ1,σ2\displaystyle{\rm{d}}\bm{\xi}_{-\sigma_{1},-\sigma_{2}} :=σs{σ1,σ2}d𝝃(τσ),\displaystyle:=\prod_{\sigma\in\llbracket s\rrbracket\setminus\{\sigma_{1},\sigma_{2}\}}{\rm{d}}\bm{\xi}(\tau_{\sigma}), (21a)
𝓧σ1,σ2\displaystyle\bm{\mathcal{X}}_{-\sigma_{1},-\sigma_{2}} :=σs{σ1,σ2}𝒳σ.\displaystyle:=\prod_{\sigma\in\llbracket s\rrbracket\setminus\{\sigma_{1},\sigma_{2}\}}\mathcal{X}_{\sigma}. (21b)

This will be useful for predicting the statistics of 𝝃(τ)μτ\bm{\xi}(\tau)\sim\mu_{\tau} at any (out-of-sample) query time τ[0,t]\tau\in[0,t].

Remark 1.

(MSBP and MOT) When the entropic regularization strength ε=0\varepsilon=0, then the MSBP (12) reduces to the multimarginal optimal transport (MOT) problem [16, 17] that has found widespread applications in barycenter computation [18], fluid dynamics [19, 20], team matching problems [21], and density functional theory [22, 23]. Further specializing MOT with the cardinality of s\llbracket s\rrbracket equals 22, yields the (bimarginal) optimal transport [24] problem.

III-D Discrete Formulation of MSBP

For finite scattered data {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} and {μσ}σs\{\mu_{\sigma}\}_{\sigma\in\llbracket s\rrbracket} as in (9), we set up a discrete version of (12) as follows.

With slight abuse of notations, we use the same symbol for the continuum and discrete version of a tensor. The ground cost in discrete formulation is represented by an order ss tensor 𝑪(n)0s\bm{C}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0}, with components [𝑪i1,,is]=𝑪(𝝃i1,,𝝃is)\left[\bm{C}_{i_{1},\ldots,i_{s}}\right]=\bm{C}\left(\bm{\xi}_{i_{1}},\ldots,\bm{\xi}_{i_{s}}\right). The component [𝑪i1,,is]\left[\bm{C}_{i_{1},\ldots,i_{s}}\right] encodes the cost of transporting unit mass for a tuple (i1,,is)(i_{1},\ldots,i_{s}).

Likewise, consider the discrete mass tensor 𝑴(n)0s\bm{M}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0} with components [𝑴i1,,is]=𝑴(𝝃i1,,𝝃is)\left[\bm{M}_{i_{1},\ldots,i_{s}}\right]=\bm{M}\left(\bm{\xi}_{i_{1}},\ldots,\bm{\xi}_{i_{s}}\right). The component [𝑴i1,,is]\left[\bm{M}_{i_{1},\ldots,i_{s}}\right] denotes the amount of transported mass for a tuple (i1,,is)(i_{1},\ldots,i_{s}).

For any σs\sigma\in\llbracket s\rrbracket, the empirical marginals 𝝁σ0n\bm{\mu}_{\sigma}\in\mathbb{R}^{n}_{\geq 0} are supported on the finite sets {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n}. We denote the projection of 𝑴(n)0s\bm{M}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0} on the σ\sigmath marginal as projσ(𝑴)\operatorname*{proj}_{\sigma}(\bm{M}). Thus projσ:(n)0s0n\operatorname*{proj}_{\sigma}:\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0}\mapsto\mathbb{R}^{n}_{\geq 0}, and is given componentwise as

[projσ(𝑴)j]=i1,,iσ1,iσ+1,,is𝑴i1,,iσ1,j,iσ+1,,is.\displaystyle\left[{\rm{proj}}_{\sigma}\!\left(\bm{M}\right)_{j}\!\right]\!\!=\!\!\!\!\sum_{i_{1},\ldots,i_{\sigma-1},i_{\sigma+1},\ldots,i_{s}}\!\!\!\!\!\!\!\!\bm{M}_{i_{1},\ldots,i_{\sigma-1},j,i_{\sigma+1},\ldots,i_{s}}. (22)

Likewise, denote the projection of 𝑴(n)0s\bm{M}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0} on the (σ1,σ2)(\sigma_{1},\sigma_{2})th marginal as projσ1,σ2(𝑴)\operatorname*{proj}_{\sigma_{1},\sigma_{2}}(\bm{M}), i.e., projσ1,σ2:(n)0s0n×n\operatorname*{proj}_{\sigma_{1},\sigma_{2}}:\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0}\mapsto\mathbb{R}^{n\times n}_{\geq 0}, and is given componentwise as

[projσ1,σ2(𝑴)j,]\displaystyle\left[{\rm{proj}}_{\sigma_{1},\sigma_{2}}\!\left(\bm{M}\right)_{j,\ell}\!\right]
=iσσs{σ1,σ2}𝑴i1,,iσ11,j,iσ1+1,,iσ21,,iσ2+1,,is.\displaystyle=\!\!\!\!\sum_{i_{\sigma}\mid\sigma\in\llbracket s\rrbracket\setminus\{\sigma_{1},\sigma_{2}\}}\!\!\!\!\!\!\!\!\bm{M}_{i_{1},\ldots,i_{\sigma_{1}-1},j,i_{\sigma_{1}+1},\ldots,i_{\sigma_{2}-1},\ell,i_{\sigma_{2}+1},\ldots,i_{s}}. (23)

We note that (22) and (23) are the discrete versions of the integrals in (12b) and (20), respectively.

With the above notations in place, the discrete version of (12) becomes

min𝑴(n)0s𝑪+εlog𝑴,𝑴\displaystyle\underset{\bm{M}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{\geq 0}}{\min}~{}\langle\bm{C}+\varepsilon\log\bm{M},\bm{M}\rangle (24a)
subject toprojσ(𝑴)=𝝁σσs.\displaystyle\text{subject to}~{}~{}{\rm{proj}}_{\sigma}\left(\bm{M}\right)=\bm{\mu}_{\sigma}\quad\forall\sigma\in\llbracket s\rrbracket. (24b)

The primal formulation (24) has nsn^{s} decision variables, and is computationally intractable. Recall that even for the bimarginal (s=2s=2) case, a standard approach [25] is to use Lagrange duality to notice that the optimal mass matrix MoptM_{\rm{opt}} is a diagonal scaling of K:=exp(C/ε)>0n×nK:=\exp(-C/\varepsilon)\in\mathbb{R}^{n\times n}_{>0}, i.e., Mopt=diag(𝒖1)Kdiag(𝒖2)M_{\rm{opt}}=\mathrm{diag}(\bm{u}_{1})K\mathrm{diag}(\bm{u}_{2}) where 𝒖1:=exp(𝝀1/ε)\bm{u}_{1}:=\exp(\bm{\lambda}_{1}/\varepsilon), 𝒖2:=exp(𝝀2/ε)\bm{u}_{2}:=\exp(\bm{\lambda}_{2}/\varepsilon), and 𝝀1,𝝀2n\bm{\lambda}_{1},\bm{\lambda}_{2}\in\mathbb{R}^{n} are the Lagrange multipliers associated with respective bimarginal constraints proj1(M)=𝝁1\operatorname*{proj}_{1}(M)=\bm{\mu}_{1}, proj2(M)=𝝁2\operatorname*{proj}_{2}(M)=\bm{\mu}_{2}. The unknowns 𝒖1,𝒖2\bm{u}_{1},\bm{u}_{2} can be obtained by performing the Sinkhorn iterations

𝒖1\displaystyle\bm{u}_{1} 𝝁1(K𝒖2),\displaystyle\leftarrow\bm{\mu}_{1}\oslash\left(K\bm{u}_{2}\right), (25a)
𝒖2\displaystyle\bm{u}_{2} 𝝁2(K𝒖1),\displaystyle\leftarrow\bm{\mu}_{2}\oslash\left(K^{\top}\bm{u}_{1}\right), (25b)

with guaranteed linear convergence [26] wherein the computational cost is governed by two matrix-vector multiplications.

The duality result holds for the multimarginal (s2s\geq 2) case. Specifically, the optimal mass tensor in (24) admits a structure 𝑴opt=𝑲𝑼\bm{M}_{\rm{opt}}=\bm{K}\odot\bm{U} where 𝑲:=exp(𝑪/ε)(n)>0s\bm{K}:=\exp(-\bm{C}/\varepsilon)\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0}, 𝑼:=σ=1s𝒖σ(n)>0s\bm{U}:=\otimes_{\sigma=1}^{s}\bm{u}_{\sigma}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0}, 𝒖σ:=exp(𝝀σ/ε)\bm{u}_{\sigma}:=\exp(\bm{\lambda}_{\sigma}/\varepsilon), and 𝝀σn\bm{\lambda}_{\sigma}\in\mathbb{R}^{n} are the Lagrange multipliers associated with the respective multimarginal constraints (24b). The unknowns 𝒖σ\bm{u}_{\sigma} can, in principle, be obtained from the multimarginal Sinkhorn iterations [27]

𝒖σ𝒖σ𝝁σprojσ(𝑲𝑼)σs,\displaystyle\bm{u}_{\sigma}\leftarrow\bm{u}_{\sigma}\odot\bm{\mu}_{\sigma}\oslash{\rm{proj}}_{\sigma}\left(\bm{K}\odot\bm{U}\right)~{}\forall\sigma\in\llbracket s\rrbracket, (26)

which generalize (25). However, computing projσ(𝑲𝑼){\rm{proj}}_{\sigma}\left(\bm{K}\odot\bm{U}\right) requires 𝒪(ns)\mathcal{O}\left(n^{s}\right) operations. Before describing how to avoid this exponential complexity (Sec. III-F), we point out the convergence guarantees for (26).

III-E Convergence for Multimarginal Sinkhorn Iterations

The iterations (26) can either be derived as alternating Bregman projections [27] or via block coordinate dual ascent [6]. Following either viewpoints leads to guaranteed linear convergence of (26); see [28], [7, Thm. 3.5].

More recent works have also established [29] guaranteed convergence for the continuous formulation (12) with linear rate of convergence [30].

III-F Multimarginal Sinkhorn Iterations for Path Structured 𝐂\bm{C}

We circumvent the exponential complexity in computing projσ(𝑲𝑼){\rm{proj}}_{\sigma}\left(\bm{K}\odot\bm{U}\right) in (26) by leveraging the path structured ground cost (13). This is enabled by a key result from [6], rephrased, and reproved below in slightly generalized form.

Proposition 1.

([6, Prop. 2]) Consider the discrete ground cost tensor 𝐂\bm{C} in (24) induced by a path structured cost (13) so that [𝐂i1,,is]=σ=1s1[Ciσ,iσ+1σσ+1]\left[\bm{C}_{i_{1},\ldots,i_{s}}\right]=\sum_{\sigma=1}^{s-1}\left[C_{i_{\sigma},i_{\sigma+1}}^{\sigma\rightarrow\sigma+1}\right] where the matrix Cσσ+10n×nC^{\sigma\rightarrow\sigma+1}\in\mathbb{R}^{n\times n}_{\geq 0} encodes the cost of transporting unit mass between each source-destination pair from the source set {𝛏i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} to the destination set {𝛏i(τσ+1)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma+1})\}_{i=1}^{n}.

Let Kσσ+1:=exp(Cσσ+1/ε)0n×nK^{\sigma\rightarrow\sigma+1}:=\exp(-C^{\sigma\rightarrow\sigma+1}/\varepsilon)\in\mathbb{R}^{n\times n}_{\geq 0}, 𝐊:=exp(𝐂/ε)(n)>0s\bm{K}:=\exp(-\bm{C}/\varepsilon)\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0}, 𝐔:=σ=1s𝐮σ(n)>0s\bm{U}:=\otimes_{\sigma=1}^{s}\bm{u}_{\sigma}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0}.

Then (22) and (23) can be expressed as

projσ(𝑲𝑼)=(𝒖1K12j=2σ1diag(𝒖j)Kjj+1)𝒖σ\displaystyle{\rm{proj}}_{\sigma}\!\left(\bm{K}\odot\bm{U}\right)\!=\!\!\left(\!\bm{u}_{1}^{\!\top}K^{1\to 2}\!\prod_{j=2}^{\sigma-1}\!\mathrm{diag}(\bm{u}_{j})K^{j\to j+1}\!\!\right)^{\!\!\top}\!\!\!\!\odot\bm{u}_{\sigma}\odot
((j=σ+1s1Kj1jdiag(𝒖j))Ks1s𝒖s)σs,\displaystyle\left(\!\!\left(\prod_{j=\sigma+1}^{s-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\!\right)\!\!K^{s-1\to s}\bm{u}_{s}\!\right)\,\forall\sigma\in\llbracket s\rrbracket, (27)

and

projσ1,σ2(𝑲𝑼)=diag(𝒖1K12j=2σ11diag(𝒖j)Kjj+1)\displaystyle{\rm{proj}}_{\sigma_{1},\sigma_{2}}\!\left(\bm{K}\odot\bm{U}\right)=\mathrm{diag}\!\left(\!\!\bm{u}_{1}^{\top}K^{1\to 2}\prod_{j=2}^{\sigma_{1}-1}\mathrm{diag}(\bm{u}_{j})K^{j\to j+1}\!\!\right)
diag(𝒖σ1)j=σ1+1σ2(Kj1jdiag(𝒖j))\displaystyle\qquad\qquad\qquad\qquad~{}~{}~{}\mathrm{diag}(\bm{u}_{\sigma_{1}})\!\!\prod_{j=\sigma_{1}+1}^{\sigma_{2}}\!\!\left(K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\right)
diag((j=σ2+1s1Kj1jdiag(𝒖j))Ks1s𝒖s)\displaystyle\qquad\qquad\qquad~{}\mathrm{diag}\!\left(\!\!\left(\prod_{j=\sigma_{2}+1}^{s-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\!\!\right)\!K^{s-1\to s}\bm{u}_{s}\!\!\right)
(σ1,σ2){s2σ1<σ2}.\displaystyle\qquad\qquad\qquad\qquad\forall(\sigma_{1},\sigma_{2})\in\{\llbracket s\rrbracket^{\otimes 2}\mid\sigma_{1}<\sigma_{2}\}. (28)
Proof.

The proof strategy is to write the Hilbert-Schmidt inner product 𝑲,𝑼\langle\bm{K},\bm{U}\rangle in two different ways.

First, recall that 𝑲:=exp(𝑪/ε)(n)>0s\bm{K}:=\exp(-\bm{C}/\varepsilon)\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0} and 𝑼:=σ=1s𝒖σ(n)>0s\bm{U}:=\otimes_{\sigma=1}^{s}\bm{u}_{\sigma}\in\left(\mathbb{R}^{n}\right)^{\otimes s}_{>0}. So following (1), we have

𝑲,𝑼\displaystyle\langle\bm{K},\bm{U}\rangle =i1,,is(j=2s[Kij1,ijj1j])j=1s(𝒖j)ij\displaystyle=\sum_{i_{1},\dots,i_{s}}\left(\prod_{j=2}^{s}\left[K^{j-1\to j}_{i_{j-1},i_{j}}\right]\right)\prod_{j=1}^{s}(\bm{u}_{j})_{i_{j}}
=i1,,is(𝒖1)i1j=2s[Kj1jdiag(𝒖j)]ij1,ij\displaystyle=\sum_{i_{1},\dots,i_{s}}(\bm{u}_{1})_{i_{1}}\prod_{j=2}^{s}\left[K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\right]_{i_{j-1},i_{j}}
=𝒖1(j=2s1Kj1jdiag(𝒖j))Ks1s𝒖s,\displaystyle=\bm{u}_{1}^{\top}\left(\prod_{j=2}^{s-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\right)K^{s-1\to s}\bm{u}_{s},

and (27) follows from [6, Lemma 1 in Appendix 1].

Next, notice that we can alternatively write

𝑲,𝑼=\displaystyle\langle\bm{K},\bm{U}\rangle=\> 𝒖1(j=2σ11Kj1jdiag(𝒖j))Kσ11σ1\displaystyle\bm{u}_{1}^{\top}\left(\prod_{j=2}^{\sigma_{1}-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\right)K^{\sigma_{1}-1\to\sigma_{1}}
(j=σ1σ21diag(𝒖j)Kjj+1)diag(𝒖σ2)\displaystyle\left(\prod_{j=\sigma_{1}}^{\sigma_{2}-1}\mathrm{diag}(\bm{u}_{j})K^{j\to j+1}\right)\mathrm{diag}(\bm{u}_{\sigma_{2}})
(j=σ2+1s1Kj1jdiag(𝒖j))Ks1s𝒖s.\displaystyle\left(\prod_{j=\sigma_{2}+1}^{s-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\right)K^{s-1\to s}\bm{u}_{s}.

Then (28) follows from [6, Lemma 2 in Appendix 1]. ∎

Remark 2.

Unlike [6, Prop. 2], our data {𝛏i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} are scattered, i.e., not on a fixed grid, hence the need for superscripts σσ+1\sigma\rightarrow\sigma+1 for the time-varying matrices in our Prop. 1. In contrast, the corresponding matrices in [6, Prop. 2] are independent of σ\sigma.

Remark 3.

We note that substituting (27) into (26) cancels the (elements of) positive vectors 𝐮σ\bm{u}_{\sigma} σs\forall\sigma\in\llbracket s\rrbracket from the corresponding numerators and denominators. This further simplifies our multimarginal Sinkhorn recursions to

𝒖σ𝝁σ((𝒖1K12j=2σ1diag(𝒖j)Kjj+1)\displaystyle\bm{u}_{\sigma}\leftarrow\bm{\mu}_{\sigma}\oslash\!\!\left(\!\!\!\left(\!\bm{u}_{1}^{\!\top}K^{1\to 2}\!\prod_{j=2}^{\sigma-1}\!\mathrm{diag}(\bm{u}_{j})K^{j\to j+1}\!\!\right)^{\!\!\top}\right.
((j=σ+1s1Kj1jdiag(𝒖j))Ks1s𝒖s))σs.\displaystyle\left.\odot\left(\!\!\left(\prod_{j=\sigma+1}^{s-1}K^{j-1\to j}\mathrm{diag}(\bm{u}_{j})\!\right)\!\!K^{s-1\to s}\bm{u}_{s}\!\!\right)\!\!\right)\>\forall\sigma\in\llbracket s\rrbracket. (29)
Remark 4.

(From exponential to linear complexity in ss) We note that (29) involves s1s-1 matrix-vector multiplications each of which has 𝒪(n2)\mathcal{O}(n^{2}) complexity. So the computational complexity for (29) becomes 𝒪((s1)n2)\mathcal{O}\left((s-1)n^{2}\right) which is linear in ss, i.e., a significant reduction from earlier 𝒪(ns)\mathcal{O}\left(n^{s}\right) complexity mentioned at the end of Sec. III-D.

Remark 5.

(Linear complexity in dd) The dimension dd of the vector 𝛏\bm{\xi} only affects the construction of the time-varying Euclidean distance matrices Cσσ+1C^{\sigma\rightarrow\sigma+1} σs1\forall\sigma\in\llbracket s-1\rrbracket in Prop. 1, which has total complexity 𝒪(sd)\mathcal{O}(sd). Once constructed, the recursions (29) are independent of dd.

We next outline how the solution tensor 𝑴opt=𝑲𝑼\bm{M}_{\rm{opt}}=\bm{K}\odot\bm{U} obtained from the converged Sinkhorn iterations can be used together with (28), to make stochastic predictions of the most likely hardware resource state in the form (10).

III-G Predicting Most Likely Distribution

For the ground cost (13) resulting from sequential information structure (Fig. 1), we utilize (28) to decompose 𝑴opt=𝑲𝑼\bm{M}_{\rm{opt}}=\bm{K}\odot\bm{U} of (24) into bimarginal transport plans

Mσ1σ2:=projσ1,σ2(𝑴opt)=projσ1,σ2(𝑲𝑼).\displaystyle M^{\sigma_{1}\to\sigma_{2}}:={\rm{proj}}_{\sigma_{1},\sigma_{2}}(\bm{M}_{\rm{opt}})={\rm{proj}}_{\sigma_{1},\sigma_{2}}(\bm{K}\odot\bm{U}). (30)

Further, when 𝑪\bm{C} is squared Euclidean, as we consider here, the maximum likelihood estimate for μτ\mu_{\tau} in (10) for a query point τ[0,t]\tau\in[0,t], is (see [6, Sec. 2.2])

μ^τ:=i=1nj=1n[Mi,jσσ+1]δ(𝝃𝝃^(τ,𝝃i(τσ),𝝃j(τσ+1)))\displaystyle\!\hat{\mu}_{\tau}\!:=\!\sum_{i=1}^{n}\!\sum_{j=1}^{n}\!\left[\!M^{\sigma\to\sigma+1}_{i,j}\!\right]\!\!\delta(\bm{\xi}-\hat{\bm{\xi}}(\tau,\bm{\xi}^{i}(\tau_{\sigma}),\bm{\xi}^{j}(\tau_{\sigma+1}))) (31)

where σs\sigma\in\llbracket s\rrbracket such that τ[τσ,τσ+1]\tau\in[\tau_{\sigma},\tau_{\sigma+1}], and

𝝃^(τ,𝝃i(τσ),𝝃j(τσ+1)):=(1λ)𝝃i(τσ)+λ𝝃j(τσ+1),\displaystyle\!\hat{\bm{\xi}}(\tau,\bm{\xi}^{i}(\tau_{\sigma}),\bm{\xi}^{j}(\tau_{\sigma+1}))\!\!:=\!(1-\lambda)\bm{\xi}^{i}(\tau_{\sigma})\!+\!\lambda\bm{\xi}^{j}(\tau_{\sigma+1}), (32a)
λ:=ττστσ+1τσ[0,1].\displaystyle\lambda:=\dfrac{\tau-\tau_{\sigma}}{\tau_{\sigma+1}-\tau_{\sigma}}\in[0,1]. (32b)

IV Overall Algorithm

The methodology proposed in Sec. III is comprised of the following three overall steps.
Step 1. Given a collection of contexts (Sec. III-A) {𝒄i}i=1ncontext\{\bm{c}^{i}\}_{i=1}^{n_{\rm{context}}}, execute the control software over [0,t][0,t] to generate hardware resource state sample snapshots (Sec. III-B) {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n}, and thereby empirical μσ\mu_{\sigma} as in (9) for all σs\sigma\in\llbracket s\rrbracket, conditional on each of the ncontextn_{\rm{context}} context samples.
Step 2. Using data from Step 1, construct Euclidean distance matrices Cσσ+1C^{\sigma\rightarrow\sigma+1} from the source set {𝝃i(τσ)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma})\}_{i=1}^{n} to the destination set {𝝃i(τσ+1)}i=1n\{\bm{\xi}^{i}(\tau_{\sigma+1})\}_{i=1}^{n} σs1\forall\sigma\in\llbracket s-1\rrbracket. Perform recursions (29) until convergence (error within desired tolerance).
Step 3. Given a query context 𝒄\bm{c} and time τ[0,t]\tau\in[0,t], return most likely distribution μ^τ\hat{\mu}_{\tau} using (31).

Remark 6.

For the three steps mentioned above, Step 1 is data generation, Step 2 is probabilistic learning using data from Step 1, and Step 3 is prediction using the learnt model.

Refer to caption
Figure 2: All 12 paths used in profiling the NMPC software (Sec. V), generated by GP sampling via Scikit-learn [31] over the domain [0,10][0,10] using mean zero and variance 10.

V Numerical Case Study

In this Section, we illustrate the application of the proposed method for a vehicle path tracking control software. All along, we provide details for the three steps in Sec. IV.

Control Software. We wrote custom software111Git repo: https://github.com/abhishekhalder/CPS-Frontier-Task3-Collaboration in C language implementing path following nonlinear model predictive controller (NMPC) for a kinematic bicycle model (KBM) [32, 33] of a vehicle with four states (x,y,v,ψ)(x,y,v,\psi) and two control inputs (ac,δ)(a_{c},\delta), given by x˙=vcos(ψ+β),y˙=vsin(ψ+β),v˙=ac,ψ˙=vrearsinβ\dot{x}=v\cos(\psi+\beta),\dot{y}=v\sin(\psi+\beta),\dot{v}=a_{c},\dot{\psi}=\frac{v}{\ell_{\text{rear}}}\sin\beta, where the sideslip angle β=arctan(rear front +rear tanδ)\beta=\arctan\left(\frac{\ell_{\text{rear }}}{\ell_{\text{front }}+\ell_{\text{rear }}}\tan\delta\right). The 4×14\times 1 state vector comprises of the inertial position (x,y)(x,y) for the vehicle’s center-of-mass, its speed vv, and the vehicle’s inertial heading angle ψ\psi. The 2×12\times 1 control vector comprises of the acceleration aca_{c}, and the front steering wheel angle δ\delta.

Refer to caption
Figure 3: Components of the measured feature vector 𝝃\bm{\xi} in (7) for all of the five control cycles for 500 executions of the NMPC software, where 𝒄=[15,15,ydes1(x)]\bm{c}=[15,15,y^{1}_{\rm{des}}(x)].

The parameters front ,rear \ell_{\text{front }},\ell_{\text{rear }} are the distances of the vehicle’s center-of-mass to the front and rear axles, respectively.

The NMPC was designed to track a desired path given as a sequence of N=200N=200 waypoint tuples {(xd(i),yd(i),vd(i))}i=1N\big{\{}(x_{d}^{(i)},y_{d}^{(i)},v_{d}^{(i)})\big{\}}_{i=1}^{N}, i.e., a sequence of desired positions and speeds (desired speed profile was numerically estimated from the desired waypoint profiles). At every control step (at most every 100 ms), using the IPOPT nonlinear program solver [34], the NMPC solved an optimization problem to minimize the sum of the crosstrack, ψ\psi, and vv errors, along with the magnitude and rapidity in change of the control inputs, over a period of time from 0 to the time horizon Hp=4H_{p}=4, subject to control magnitude and slew rate constraints. For formulation details and control performance achieved, see [35]. For implementation and parameter values, we refer the readers to the Git repository in the footnote.

While closing the loop with KBM with computed control values requires minimal computational overhead, the NMPC is computationally demanding. In the case where multiple vehicle controllers are available, it is of practical interest to predict the hardware resource usage for the NMPC for one to several control cycles, conditional on the CPS context 𝒄\bm{c} (Sec. III-A) at a given time. For this we ‘profile’ the NMPC, meaning we run the software many times for different values of 𝒄\bm{c} as in (4), measuring time evolution of the hardware resource state 𝝃\bm{\xi} as in (7). We use these profiles to generate marginals μσ\mu_{\sigma} as in (9), thus completing Step 1 in Sec. IV.

We next provide details on generating control software execution profiles for our specific case study.
Generating Execution Profiles. To gather the execution profiles for our NMPC control software, we used an Ubuntu 16.04.7 Linux machine with an Intel Xeon E5-2683 v4 CPU. We leveraged Intel’s Cache Allocation Technology (CAT) [36] and Memguard [37] to control allocation of LLC partitions and memory bandwidth available to the control software, respectively. Both LLC partitions and memory bandwidth were allocated in blocks of 2MB.

Utilizing these resource partitioning mechanisms, we ran our application on an isolated CPU and used the Linux perf tool [38], version 4.9.3, to sample 𝝃\bm{\xi} every 10 ms.

For each run of our application, we set the cache and memory bandwidth to a static allocation and pass as input a path for the NMPC to follow, represented as an array of desired (x,y)(x,y) coordinates. We then execute the control software for nc:=5n_{c}:=5 uninterrupted “control cycles”, wherein the NMPC gets the KBM state, makes a control decision, and updates the KBM state.

We profile over 12 unique desired paths to track, denoted {ydesi(x)}i=112\{y_{\rm{des}}^{i}(x)\}_{i=1}^{12}, and 5 unique vectors of {𝒄cyberi}i=15\{\bm{c}_{\rm{cyber}}^{i}\}_{i=1}^{5}, comprising ncontext=12×5=60n_{\rm{context}}=12\times 5=60 samples for 𝒄\bm{c}. Conditional on each of these 6060 context samples {𝒄i}i=1ncontext=60\{\bm{c}^{i}\}_{i=1}^{n_{\rm{context}}=60}, we run the software for 500 profiles for each unique 𝒄\bm{c} for a total of 30,000 profiles.

The sample paths {ydesi(x)}i=112\{y_{\rm{des}}^{i}(x)\}_{i=1}^{12} in (6) were all generated for x[0,10]x\in[0,10] using a GP with mean zero and variance 10 [31], and are shown in Fig. 2.

Our vectorial samples {𝒄cyberi}i=15\{\bm{c}_{\rm{cyber}}^{i}\}_{i=1}^{5} in (5) were [1,1][1,1]^{\top}, [5,5][5,5]^{\top}, [10,10][10,10]^{\top}, [15,15][15,15]^{\top}, and [20,20][20,20]^{\top}, where each entry represents the number of cache/memory bandwidth partitions from 11 to 2020. These values were selected to broadly cover the range of possible hardware contexts.

Refer to caption
Figure 4: Normalized histograms (gray filled) and kernel density estimates (KDEs) (solid line) for the end times of all of the five control cycles for 500 executions of the NMPC software conditioned on a fixed CPS context 𝒄\bm{c} shown above. The KDEs used Gaussian kernel with bandwidths computed via cross validation [39, 40].
 Control cycle  Mean  Standard deviation
#1 0.1181 0.0076
#2 0.2336 0.0106
#3 0.3495 0.0127
#4 0.4660 0.0143
#5 0.5775 0.0159
TABLE I: The means and standard deviations of the end times for the nc=5n_{c}=5 control cycles data shown in Fig. 4.
Refer to caption
Figure 5: Linear convergence of Sinkhorn iterations (29) for sint=4s_{\rm{int}}=4 w.r.t. the Hilbert’s projective metric dHd_{\rm{H}} in (3) between uσsu_{\sigma\in\llbracket s\rrbracket} at iteration indices kk and k1k-1.
Refer to caption
Figure 6: Predicted μ^τ^j\hat{\mu}_{\hat{\tau}_{j}} (blue) vs. measured μτ^j\mu_{\hat{\tau}_{j}} (red) at times τ^j5\hat{\tau}_{j\in\llbracket 5\rrbracket} during the 3rd control cycle with sint=4s_{\rm{int}}=4. Distributions at the control cycle boundaries are in black.
 sints_{\rm{int}}  W1W_{1}  W2W_{2}  W3W_{3}  W4W_{4}  W5W_{5}
0 2.04892.0489 - - - -
11 2.26952.2695 1.17501.1750 - - -
22 5.77175.7717 0.91630.9163 0.37940.3794 - -
33 2.24132.2413 1.64321.6432 1.23451.2345 0.60100.6010 -
44 0.63720.6372 1.26911.2691 0.91760.9176 0.66890.6689 0.21110.2111
TABLE II: Number of intracycle marginals sints_{\rm{int}} vs. Wasserstein distances WjW_{j} as in (33). All entries are scaled up by 10410^{4}.

Applying the Proposed Algorithm. Given a query context 𝒄\bm{c}, we determine the closest CPS context for which profiling data is available, using the Euclidean distance between cyber context vectors (5), and the Fréchet distance [41] between physical context curves (6). In this case study, we consider a query context with closest 𝒄cyber=[15,15]\bm{c}_{\rm{cyber}}=\begin{bmatrix}15,15\end{bmatrix}^{\top} and closest 𝒄phys=ydes1(x)\bm{c}_{\rm{phys}}=y_{\rm{des}}^{1}(x). Profiling data for this 𝒄\bm{c} is shown in Fig. 3.

For each of the n=500n=500 profiles, we are given the end times for each of the nc=5n_{c}=5 control cycles. We then determine the statistics of the cycle end times (Fig. 4) and compute the empirical distributions of 𝝃\bm{\xi} at the means (Table I) of the control cycle start/end time boundaries. For empirical distributions at times between cycle boundaries, we let sints_{\rm{int}} be the number of marginals equispaced-in-time between each cycle boundary. We then set τσs\tau_{\sigma\in\llbracket s\rrbracket} from the means in Table I, where s:=1+nc(sint+1)s:=1+n_{c}(s_{\rm{int}}+1) and τσ(sint+1)+1\tau_{\sigma(s_{\rm{int}}+1)+1} is the sampled mean end time for the σ\sigmath control cycle.

Our distributions are as per (9), where 𝝃i(τσ)\bm{\xi}^{i}(\tau_{\sigma}) is the sample of the hardware resource state (7) at time τσ\approx\tau_{\sigma} (within 5ms) for profile ii given context 𝒄\bm{c}.

We set ε=0.1\varepsilon=0.1 and solve the discrete MSBP (24) with squared Euclidean cost 𝑪\bm{C} using (29). Fig. 5 shows that the Sinkhorn iterations converge linearly (Sec. III-E). We emphasize here that the computational complexity of proposed algorithm is minimal, thanks to the path structure of the information. In particular, we solve the MSBP (24) with ns=50026n^{s}=500^{26} decision variables in approx. 10 s in MATLAB on an Ubuntu 22.04.2 LTS Linux machine with an AMD Ryzen 7 5800X CPU.

Fig. 6 compares predicted versus observed empirical distributions. Specifically, Fig. 6 shows sint+1=5s_{\rm{int}}+1=5 distributional predictions μ^τ^j\hat{\mu}_{\hat{\tau}_{j}} at times τ^j\hat{\tau}_{j}, temporally equispaced throughout the duration of the 3rd control cycle, i.e., between τ2(sint+1)+1\tau_{2(s_{\rm{int}}+1)+1} and τ3(sint+1)+1\tau_{3(s_{\rm{int}}+1)+1}, with

τ^j=τ2(sint+1)+1+(τ3(sint+1)+1τ2(sint+1)+1sint+2)j,\hat{\tau}_{j}=\tau_{2(s_{\rm{int}}+1)+1}+\Bigg{(}\frac{\tau_{3(s_{\rm{int}}+1)+1}-\tau_{2(s_{\rm{int}}+1)+1}}{s_{\rm{int}}+2}\Bigg{)}j,

where jsint+1j\in\llbracket s_{\rm{int}}+1\rrbracket. We used (31) with σ=2(sint+1)+j\sigma=2(s_{\rm{int}}+1)+j, since τ^j[τ2(sint+1)+j,τ2(sint+1)+j+1]\hat{\tau}_{j}\in[\tau_{2(s_{\rm{int}}+1)+j},\tau_{2(s_{\rm{int}}+1)+j+1}].

From Fig. 6 it is clear that the measure-valued predictions, while largely accurate, are prone to error in cases where the software resource usage behavior changes in bursts too short to be appear in our observations. It follows that increasing the number of snapshots should yield an improvement in overall accuracy. In this example, we achieve this by increasing sints_{\rm{int}}. Table II reports the Wasserstein distances W(,)W(\cdot,\cdot) between the corresponding predicted and measured distributions:

Wj:=W(μ^τ^j,μτ^j)jsint+1.\displaystyle W_{j}:=W(\hat{\mu}_{\hat{\tau}_{j}},\mu_{\hat{\tau}_{j}})\quad\forall j\in\llbracket s_{\rm{int}}+1\rrbracket. (33)

We computed each of these WjW_{j} as the square root of the optimal value of the corresponding linear program that results from specializing (24) with s=2s=2, ε=0\varepsilon=0.

VI Concluding Remarks

We apply recent algorithmic advances in solving the MSBP to learn stochastic hardware resource usage by control software. The learnt model demonstrates accurate nonparametric measure-valued predictions for the joint hardware resource state at a desired time conditioned on CPS context. The formulation and its solution comes with a maximum likelihood guarantee in the space of probability measures, and the algorithm enjoys a guaranteed linear convergence rate.

References

  • [1] G. Bernat, A. Colin, and S. Petters, “WCET analysis of probabilistic hard real-time systems,” in 23rd IEEE Real-Time Systems Symposium, 2002 (RTSS), 2002, pp. 279–288.
  • [2] M. Lv, N. Guan, Y. Zhang, Q. Deng, G. Yu, and J. Zhang, “A survey of WCET analysis of real-time operating systems,” in 2009 International Conference on Embedded Software and Systems, 2009, pp. 65–72.
  • [3] R. Gifford, N. Gandhi, L. T. X. Phan, and A. Haeberlen, “Dna: Dynamic resource allocation for soft real-time multicore systems,” in 2021 IEEE 27th Real-Time and Embedded Technology and Applications Symposium (RTAS).   IEEE, 2021, pp. 196–209.
  • [4] K. Zhang, J. Sprinkle, and R. G. Sanfelice, “Computationally aware switching criteria for hybrid model predictive control of cyber-physical systems,” IEEE Transactions on Automation Science and Engineering, vol. 13, no. 2, pp. 479–490, 2016.
  • [5] P. J. Bushell, “Hilbert’s metric and positive contraction mappings in a Banach space,” Archive for Rational Mechanics and Analysis, vol. 52, pp. 330–338, 1973.
  • [6] F. Elvander, I. Haasler, A. Jakobsson, and J. Karlsson, “Multi-marginal optimal transport using partial information with applications in robust localization and sensor fusion,” Signal Processing, vol. 171, p. 107474, 2020.
  • [7] I. Haasler, A. Ringh, Y. Chen, and J. Karlsson, “Multimarginal optimal transport with a tree-structured cost and the Schrodinger bridge problem,” SIAM Journal on Control and Optimization, vol. 59, no. 4, pp. 2428–2453, 2021.
  • [8] C. Léonard, “A survey of the Schrödinger problem and some of its connections with optimal transport,” Discrete and Continuous Dynamical Systems-Series A, vol. 34, no. 4, pp. 1533–1574, 2014.
  • [9] Y. Chen, T. T. Georgiou, and M. Pavon, “Stochastic control liaisons: Richard Sinkhorn meets Gaspard Monge on a Schrodinger bridge,” Siam Review, vol. 63, no. 2, pp. 249–313, 2021.
  • [10] A. Dembo and O. Zeitouni, Large deviations techniques and applications.   Springer Science & Business Media, 2009, vol. 38.
  • [11] H. Follmer, “Random fields and diffusion processes,” Ecole d’Ete de Probabilites de Saint-Flour XV-XVII, 1985-87, 1988.
  • [12] I. N. Sanov, On the probability of large deviations of random variables.   United States Air Force, Office of Scientific Research, 1958.
  • [13] M. Pavon, G. Trigila, and E. G. Tabak, “The data-driven Schrödinger bridge,” Communications on Pure and Applied Mathematics, vol. 74, no. 7, pp. 1545–1573, 2021.
  • [14] I. Csiszár, “I-divergence geometry of probability distributions and minimization problems,” The annals of probability, pp. 146–158, 1975.
  • [15] J. M. Borwein, A. S. Lewis, and R. D. Nussbaum, “Entropy minimization, DAD problems, and doubly stochastic kernels,” Journal of Functional Analysis, vol. 123, no. 2, pp. 264–307, 1994.
  • [16] L. Rüschendorf and L. Uckelmann, “On the n-coupling problem,” Journal of multivariate analysis, vol. 81, no. 2, pp. 242–258, 2002.
  • [17] B. Pass, “Multi-marginal optimal transport: theory and applications,” ESAIM: Mathematical Modelling and Numerical Analysis-Modélisation Mathématique et Analyse Numérique, vol. 49, no. 6, pp. 1771–1790, 2015.
  • [18] M. Agueh and G. Carlier, “Barycenters in the Wasserstein space,” SIAM Journal on Mathematical Analysis, vol. 43, no. 2, pp. 904–924, 2011.
  • [19] Y. Brenier, “Generalized solutions and hydrostatic approximation of the Euler equations,” Physica D: Nonlinear Phenomena, vol. 237, no. 14-17, pp. 1982–1988, 2008.
  • [20] J.-D. Benamou, G. Carlier, and L. Nenna, “Generalized incompressible flows, multi-marginal transport and sinkhorn algorithm,” Numerische Mathematik, vol. 142, pp. 33–54, 2019.
  • [21] G. Carlier, A. Oberman, and E. Oudet, “Numerical methods for matching for teams and Wasserstein barycenters,” ESAIM: Mathematical Modelling and Numerical Analysis, vol. 49, no. 6, pp. 1621–1642, 2015.
  • [22] G. Buttazzo, L. De Pascale, and P. Gori-Giorgi, “Optimal-transport formulation of electronic density-functional theory,” Physical Review A, vol. 85, no. 6, p. 062502, 2012.
  • [23] C. Cotar, G. Friesecke, and C. Klüppelberg, “Density functional theory and optimal transportation with Coulomb cost,” Communications on Pure and Applied Mathematics, vol. 66, no. 4, pp. 548–599, 2013.
  • [24] G. Peyré and M. Cuturi, “Computational optimal transport: With applications to data science,” Foundations and Trends® in Machine Learning, vol. 11, no. 5-6, pp. 355–607, 2019.
  • [25] M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Advances in neural information processing systems, vol. 26, 2013.
  • [26] J. Franklin and J. Lorenz, “On the scaling of multidimensional matrices,” Linear Algebra and its applications, vol. 114, pp. 717–735, 1989.
  • [27] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré, “Iterative Bregman projections for regularized transportation problems,” SIAM Journal on Scientific Computing, vol. 37, no. 2, pp. A1111–A1138, 2015.
  • [28] H. H. Bauschke and A. S. Lewis, “Dykstras algorithm with Bregman projections: A convergence proof,” Optimization, vol. 48, no. 4, pp. 409–427, 2000.
  • [29] S. D. Marino and A. Gerolin, “An optimal transport approach for the Schrödinger bridge problem and convergence of Sinkhorn algorithm,” Journal of Scientific Computing, vol. 85, no. 2, p. 27, 2020.
  • [30] G. Carlier, “On the linear convergence of the multimarginal Sinkhorn algorithm,” SIAM Journal on Optimization, vol. 32, no. 2, pp. 786–794, 2022.
  • [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [32] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic vehicle models for autonomous driving control design,” in 2015 IEEE intelligent vehicles symposium (IV).   IEEE, 2015, pp. 1094–1099.
  • [33] S. Haddad, A. Halder, and B. Singh, “Density-based stochastic reachability computation for occupancy prediction in automated driving,” IEEE Transactions on Control Systems Technology, vol. 30, no. 6, pp. 2406–2419, 2022.
  • [34] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” Mathematical Programming, vol. 106, no. 1, pp. 25–57, 2006.
  • [35] https://github.com/abhishekhalder/CPS-Frontier-Task3-Collaboration/blob/master/Codes/kbm_sim/Documentation_KinematicBicycle_Controllers.pdf, accessed: 2023-09-29.
  • [36] Intel Corporation, “Improving real-time performance by utilizing cache allocation technology,” Apr. 2015, White Paper.
  • [37] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory bandwidth management for efficient performance isolation in multi-core platforms,” IEEE Transactions on Computers, vol. 65, no. 2, pp. 562–576, Feb 2016.
  • [38] “perf(1) — linux manual page,” https://man7.org/linux/man-pages/man1/perf.1.html, accessed: 2023-09-29.
  • [39] A. W. Bowman, “An alternative method of cross-validation for the smoothing of density estimates,” Biometrika, vol. 71, no. 2, pp. 353–360, 1984.
  • [40] P. Hall, J. Marron, and B. U. Park, “Smoothed cross-validation,” Probability theory and related fields, vol. 92, no. 1, pp. 1–20, 1992.
  • [41] T. Eiter and H. Mannila, “Computing discrete Fréchet distance,” 1994. [Online]. Available: https://api.semanticscholar.org/CorpusID:16010565