Time-invariant prefix-free source coding for MIMO LQG control

Travis C. Cuvelier, Takashi Tanaka, and Robert W. Heath, Jr This work was supported in part by the National Science Foundation under Grant No. ECCS-1711702 and and CAREER Award #1944318. T. Cuvelier is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, TX, 78712 USA e-mail: [email protected]. Tanaka is with the Department of Aerospace Engineering and Engineering Mechanics, The University of Texas at Austin, TX, 78712 USA e-mail: [email protected]. Heath is with the Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, 27606 USA e-mail: [email protected].

Abstract

In this work we consider discrete-time multiple-input multiple-output (MIMO) linear-quadratic-Gaussian (LQG) control where the feedback consists of variable length binary codewords. To simplify the decoder architecture, we enforce a strict prefix constraint on the codewords. We develop a data compression architecture that provably achieves a near minimum time-average expected bitrate for a fixed constraint on the LQG performance. The architecture conforms to the strict prefix constraint and does not require time-varying lossless source coding, in contrast to the prior art.

I Introduction

In this work, we consider discrete-time MIMO LQG control where feedback occurs via variable-length packets (codewords) of bits that we assume are conveyed reliably (without error) from a sensor/encoder to a decoder/controller. For some constraint on LQG control performance, we seek to design a sensor/encoder and decoder/controller that minimize the time-averaged expected bitrate of the binary channel.

The motivation for this formulation is communication-efficient remote control over wireless communications. In particular, we imagine a scenario where a remote sensor platform measures some dynamical system and conveys the measurements to the controller over a noiseless binary channel. At the physical and medium access control layers, reliable control of autonomous agents over wireless has motivated the development of ultra low latency reliable communication (ULLRC). While progress has been made on the development ULLRC, the fundamental scarcity of physical layer resources motivates approaches that minimize the use of these resources. In contrast to the work on ULLRC, we aim higher on the protocol stack; namely we propose to develop control and data compression (source coding) strategies to achieve satisfactory control performance with minimal communication bitrate. The minimum average length, in bits, of the binary packets the sensor encodes are an effective surrogate for the minimum amount of physical layer resources (time/bandwidth/power) required to achieve satisfactory performance for a fixed reliability (cf. e.g. [1, Section 2.6]). We impose a prefix constraint on the codewords that ensures that other users sharing the same communication medium can identify the end of each transmitted message and thus identify the channel as free-to-use. In contrast to the prior art, this work imposes a time-invariant (TI) prefix constraint on the codewords, which simplifies the decoding architecture architecture; the end of codewords can be detected via comparison with a TI list [2].

In the prior literature, achievability results for the problem of interest follow from asymptotically bounding the output entropy of a quantizer [3][4][5]. While these bounds can nearly be achieved with zero-delay lossless source coding, they generally require that the lossless code be adapted, at every timestep, to the probability mass function of the quantizer output (cf. [6, Section IV.A]). This adds a great deal of computational complexity to the encoder and decoder.

In this work, we propose a data compression architecture for MIMO plants based on [4] and [6] that attains the desired LQG performance, satisfies a TI prefix constraint, and does not require a time-varying lossless source codec. Starting from the architecture of [4], we prove that the quantizer output has a limiting distribution. We propose to losslessly encode the quantizations with a fixed prefix-free code adapted the limiting distribution. We prove that under this modification, the codewords satisfy a TI prefix constraint without an increase in codeword length and without the complexity of time-varying lossless source coding. The architecture nearly achieves a known lower bound on the minimum expected bitrate. In particular, we use results from systems theory to extend the scalar TI achievability approach in [6] to the MIMO setting.

I-A Related Work

Our work follows from the architecture for LQG control with minimum bitrate prefix-free variable-length coding in [3]. For scalar systems, [3] derived a lower bound on the bitrate of a prefix-free source codec inserted into an LQG feedback channel in terms of Massey’s directed information (DI) [7]. This motivated a rate-distortion problem for the tradeoff between DI and LQG control performance. Using entropy-coded dithered quantization (ECDQ), [3] proved that the DI lower bound was nearly achievable. In ECDQ, uses a shared sequence of IID uniform random variables to whiten the error induced by quantization process. The quantization are then encoded via an entropy source code (e.g. a Shannon-Fano-Elias code). The rate-distortion formulation was generalized to MIMO plants in [8]. Likewise, the achievability approach was generalized in [4]. New analytical lower bounds (for MIMO systems) for the bitrate/control performance tradeoff were derived in [5]. It was further demonstrated that such bounds were nearly achievable without the use of a dither signal in the high communication rate/strict control cost regime [5]. The achievablilty approaches in [3], [4], and [5] implicitly assume that quantizations are encoded using a time-varying lossless source code [6]. More recently, [9] and [2] refined the lower bound analysis in [3], clarifying that the bound still applies even when the encoder and decoder share a dither signal.

The use of time-varying source codes complicates compression architectures; generally speaking, the codebooks used to encode quantizations must be updated at every timestep. Furthermore, in network settings, a source code’s prefix properties may be used by receivers to detect the end of transmissions. If a TI prefix code is used, the set of prefixes against which received codewords are compared is constant over time [2]. In the time-varying case, there are no such guarantees. In [6] it was shown that when a dither signal is available, there exists a TI quantization and coding scheme for scalar plants that nearly achieves the DI lower bound. In this work, we generalize this to the MIMO setting; namely we prove the existence of a quantizer and TI source codec for MIMO plants that nearly achieves the directed information lower bound [8]. We believe this to be the first such “time-invariant” achievability result for MIMO plants in the literature.

Work on fixed bitrate and event-driven communication strategies for control are also relevant to this work. In [10], numerical experiments demonstrated that Lloyd-Max style quantization could be used to attain a control performance competitive with known variable-rate upper bounds. However, the quantizer in [10] is time-varying, and many experimental realizations had time-average control costs that greatly exceeded the variable-length bounds [10]. More recently, [11] have proposed the the mean time to quantizer saturation as a metric for the analysis of fixed-length coding in LQG feedback control. Periodic coding strategies were devised, and their performances and escaped times were analyzed theoretically and with an experiment [11]. More recently [12] considered tracking a scalar Markov source under an event-based communication paradigm. Under this formulation, an encoder monitoring a plant can send binary codewords (without prefix constraints) at arbitrary (continuous-time) instants to a synchronized decoder [12]. For a constraint on estimator error, the minimum bitrate communication policy was deduced. In this work, as in [4], [5], [10], we consider a sampled discrete-time control and communication model. While event-driven schemes like [12] may be able to achieve smaller bitrates than these works, the models for control and communication in [4], [5], [10] are more amenable to real-world sampled data systems with finite sensing and communication bandwidths.

I-B Notation

We denote constant scalars and vectors in lowercase $x$ , scalar and vector random variables in boldface $\boldsymbol{{x}}$ , and matrices by capital letters $X$ . We let $[{x}]_{i}$ denote the $i^{\mathrm{th}}$ element of ${x}$ , $I_{m}$ denote the $\mathbb{R}^{m\times m}$ identity matrix, $0_{m\times m}$ the respective zero matrix. Let $\lVert X\rVert_{2}$ denote the largest singular value of $X$ . We write P(S)D for “symmetric positive (semi)definite”, and let $\mathbb{S}^{m}_{+}$ denote the set of $m\times m$ PSD matrices. We let $\succ$ , $\succeq$ denote the standard partial order on the PSD cone, e.g. if $A,B\in\mathbb{R}^{m}$ , we write $A\succ B$ if $A-B$ is PD, likewise $A\succeq B$ if $A-B$ is PSD. For time domain sequences, let $\{\boldsymbol{{x}}_{t}\}$ denote $(\boldsymbol{{x}}_{0},\boldsymbol{{x}}_{1},\dots)$ , $\boldsymbol{{x}}_{a}^{b}$ denote $(\boldsymbol{{x}}_{a},\dots,\boldsymbol{{x}}_{b})$ if $b\geq a$ , $\boldsymbol{{x}}_{a}^{b}=\emptyset$ otherwise. Let $\boldsymbol{{x}}^{b}=\boldsymbol{{x}}_{0}^{b}$ . Denote the set of finite-length binary strings by $\{0,1\}^{*}$ . For $a\in\{0,1\}^{*}$ , let $\ell(a)$ be the length of $a$ in bits. Let $\Delta\mathbb{Z}^{m}$ denote the set of $m-$ tuples whose elements are integer multiples of $\Delta$ . Define the set ${\mathcal{B}}^{m}(\Delta)=\{{x}\in\mathbb{R}^{m}:\lVert{x}\rVert_{\infty}\leq\frac{\Delta}{2}\}$ . For a set $\mathcal{S}$ , define the indicator function of ${x}\in\mathcal{S}$ as $1_{{x}\in\mathcal{S}}$ . For a topological space $\mathbb{T}$ , let $\mathbb{B}(\mathbb{T})$ denote the standard Borel $\sigma$ -algebra of $\mathbb{T}$ . For Euclidean spaces, let $\lambda$ denote the Lebesgue measure. We use PMF for “probability mass function”, PDF for “probability density function” (with respect to $\lambda$ ), $\boldsymbol{{a}}\perp\boldsymbol{{b}}$ for “ $\boldsymbol{{a}}$ and $\boldsymbol{{b}}$ are independent”, and $H$ for entropy.

II System Model and Problem Formulation

We consider the system model depicted in Fig. 1. The system to be controlled is a TI, multidimensional, linear dynamical system (i.e., a MIMO plant) plant controlled via a feedback model where communication occurs over an ideal (delay and error free) binary channel. The plant is fully observable to an encoder/sensor block, which conveys a variable-length binary codeword $\boldsymbol{{a}}_{t}\in\{0,1\}^{*}$ over the channel to a combined decoder/controller. Upon receipt of the codeword, the decoder/controller designs the control input. Denote the state vector $\boldsymbol{{x}}_{t}\in\mathbb{R}^{m}$ , the control input $\boldsymbol{{u}}_{t}\in\mathbb{R}^{u}$ , and let $\boldsymbol{{w}}_{t}\sim\mathcal{N}({0},{W})$ denote processes noise assumed to be IID over time. We assume ${W}\succ{0}_{m\times m}$ , i.e., the process noise covariance is full rank. We assume assume that $\boldsymbol{{x}}_{0}\sim\mathcal{N}({0},{X}_{0})$ for some $X_{0}\succeq 0$ . For ${A}\in\mathbb{R}^{m\times m}$ the system matrix and ${B}\in\mathbb{R}^{m\times u}$ the feedback gain matrix, for $t\geq 0$ the plant dynamics are given by $\boldsymbol{{x}}_{t+1}={A}\boldsymbol{{x}}_{t}+{B}\boldsymbol{{u}}_{t}+\boldsymbol{{w}}_{t}$ . To ensure finite control cost is attainable, we assume $({A},{B})$ are stabilizable. We assume that the encoder and decoder share access to a random uniform dither signal $\{\boldsymbol{{d}}_{t}\}$ . We assume that, for some $\Delta>0$ to be specified, the random vectors $\boldsymbol{{d}}_{t}\in\mathbb{R}^{m}$ have components that are independently uniformly distributed on $[-\Delta/2,\Delta/2]$ , and that the sequence $\{\boldsymbol{{d}}_{t}\}$ is IID over time. We assume that $\{\boldsymbol{{w}}_{t}\}$ , $\{\boldsymbol{{d}}_{t}\}$ , and $\boldsymbol{{x}}_{0}$ are mutually independent. In real-world systems, this shared randomness can be effectively accomplished using synchronized pseudorandom number generators. We assume that the encoder/sensor and the decoder/controller may be randomized given their inputs. The encoder/sensor policy in Fig. 1 is a sequence of causally conditioned Borel measurable kernels denoted $\mathbb{P}_{\mathrm{E}}[\boldsymbol{{a}}_{0}^{\infty}||\boldsymbol{{d}}_{0}^{\infty},\boldsymbol{{x}}_{0}^{\infty}]=\left\{\mathbb{P}_{\mathrm{E}}[\boldsymbol{{a}}_{t}|\boldsymbol{{a}}_{0}^{t-1},\boldsymbol{{d}}_{0}^{t},\boldsymbol{{x}}_{0}^{t}]\right\}_{t}$ , and that corresponding decoder/controller policy is given by $\mathbb{P}_{\mathrm{C}}[\boldsymbol{{u}}^{\infty}||\boldsymbol{{a}}^{\infty},\boldsymbol{{d}}^{\infty}]=\left\{\mathbb{P}_{\mathrm{C}}[\boldsymbol{{u}}_{t}|\boldsymbol{{a}}^{t},\boldsymbol{{d}}^{t},\boldsymbol{{u}}^{t-1}]\right\}_{t}$ . Note that under the assumed dynamics, $\boldsymbol{{x}}^{t}$ is a deterministic function of $\boldsymbol{{x}}$ , $\boldsymbol{{a}}^{t-1}$ , $\boldsymbol{{u}}^{t-1}$ , and $\boldsymbol{{w}}^{t-1}$ . We encode conditional independence assumptions in the system model by a factorization of the one-step transition kernels for $\boldsymbol{{a}}_{t}$ , $\boldsymbol{{d}}_{t}$ , $\boldsymbol{{u}}_{t}$ , and $\boldsymbol{{w}}_{t}$ . For $t\geq 0$ , we assume the kernels factorize via

	$\mathbb{P}[\boldsymbol{{a}}_{t+1},\boldsymbol{{u}}_{t+1}\|\boldsymbol{{a}}^{t},\boldsymbol{{d}}^{t+1},\boldsymbol{{u}}^{t},\boldsymbol{{w}}^{t},\boldsymbol{{x}}_{0}]=\\ \mathbb{P}_{\mathrm{E}}[\boldsymbol{{a}}_{t+1}\|\boldsymbol{{a}}^{t},\boldsymbol{{d}}^{t+1},\boldsymbol{{x}}^{t+1}]\mathbb{P}_{\mathrm{C}}[\boldsymbol{{u}}_{t+1}\|\boldsymbol{{a}}^{t+1},\boldsymbol{{d}}^{t+1},\boldsymbol{{u}}^{t}],$		(1a)
	$\mathbb{P}[\boldsymbol{{a}}_{t+1},\boldsymbol{{d}}_{t+1},\boldsymbol{{u}}_{t+1},\boldsymbol{{w}}_{t+1}\|\boldsymbol{{a}}^{t},\boldsymbol{{d}}^{t},\boldsymbol{{u}}^{t},\boldsymbol{{w}}^{t},\boldsymbol{{x}}_{0}]=\\ \mathbb{P}[\boldsymbol{{a}}_{t+1},\boldsymbol{{u}}_{t+1}\|\boldsymbol{{a}}^{t},\boldsymbol{{d}}^{t},\boldsymbol{{u}}^{t},\boldsymbol{{w}}^{t},\boldsymbol{{x}}_{0}]\mathbb{P}[\boldsymbol{{d}}_{t+1}]\mathbb{P}[\boldsymbol{{w}}_{t+1}],$		(1b)

and that we have initially $\mathbb{P}[\boldsymbol{{a}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{u}},\boldsymbol{{w}}_{0}|\boldsymbol{{x}}_{0}]$ $=$ $\mathbb{P}[\boldsymbol{{w}}_{0}]\mathbb{P}[\boldsymbol{{d}}_{0}]\mathbb{P}_{\mathrm{E}}[\boldsymbol{{a}}_{0}|\boldsymbol{{x}}_{0},\boldsymbol{{d}}_{0}]$ $\mathbb{P}_{\mathrm{C}}[\boldsymbol{{u}}_{0}|\boldsymbol{{a}}_{0},\boldsymbol{{d}}_{0}]$ . Implications of these factorization are discussed in the caption of Fig. 1.

Refer to caption — Figure 1: The encoder can select the codeword $\boldsymbol{{a}}_{t}$ randomly given “the information it knows at time $t$ ”. When $\boldsymbol{{a}}_{t}$ arrives at the decoder, the decoder can randomly generate the control $\boldsymbol{{u}}_{t}$ given $\boldsymbol{{a}}_{t}$ as well as its own prior knowledge. The encoder and decoder share access to $\boldsymbol{{d}}_{t}$ , which is IID and generated “independently” of all past system variables.

We require the codewords to conform to a TI prefix constraint, namely that for all $i,j$ and distinct $a_{1},a_{2}\in\{0,1\}^{*}$ with both $\mathbb{P}[\boldsymbol{{a}}_{i}=a_{1}]>0$ and $\mathbb{P}[\boldsymbol{{a}}_{j}=a_{2}]>0$ , $a_{1}$ is not a prefix of $a_{2}$ and vice-versa. Thus, not only must the support of the random variable $\boldsymbol{{a}}_{i}$ be a set of prefix-free binary codewords, we also require the union of the supports of the $\{\boldsymbol{{a}}_{i}\}$ be prefix-free. This is a stronger prefix constraint than was considered in [4]. The TI constraint may enable for more computationally efficient communication resource sharing; the end of a codeword can be unambiguously identified via comparing received messages against the TI union of supports. We are interested in the following optimization, over policies $\mathbb{P}_{\mathrm{E}},\mathbb{P}_{\mathrm{C}}$ that conform to the prefix constraint:

\mathcal{L}(\gamma)=\left\{\begin{aligned} &\underset{\mathbb{P}_{\mathrm{E}},\mathbb{P}_{\mathrm{C}}}{\inf}\text{ }\frac{1}{T}\sum\nolimits_{t=0}^{T-1}\mathbb{E}[\ell(\boldsymbol{{a}}_{t})]\\ &\text{s.t. }\frac{1}{T}\sum\nolimits_{t=0}^{T-1}\mathbb{E}[\lVert\boldsymbol{{x}}_{t+1}\rVert_{{Q}}^{2}+\lVert\boldsymbol{{u}}_{t}\rVert_{{\Phi}}^{2}]\leq\gamma,\end{aligned}\right.

(2)

where ${Q}\succeq{0}$ , ${\Phi}\succ{0}$ , and $\gamma$ is the maximum tolerable LQG cost. Let $S$ be a stabilizing solution to the discrete algebraic Riccati equation (DARE) $A^{\mathrm{T}}SA-S-A^{\mathrm{T}}SB(B^{\mathrm{T}}SB+\Phi)^{-1}B^{\mathrm{T}}SA+Q=0$ , $K=-(B^{\mathrm{T}}SB+\Phi)^{-1}B^{\mathrm{T}}SA$ , and $\Theta=K^{\mathrm{T}}(B^{\mathrm{T}}SB+\Phi)K$ . Consider the optimization

\displaystyle\mathcal{R}(\gamma)

\displaystyle=

\displaystyle\left\{\begin{aligned} &\underset{\begin{subarray}{c}P,\Pi,\in\mathbb{R}^{m\times m}\\ P,\Pi\succeq 0\end{subarray}}{\inf}\frac{1}{2}(-\log_{2}{\det{\Pi}}+\log_{2}{\det{W}})\\ &\text{ }\text{s.t. }\mathrm{Tr}(\Theta P)+\mathrm{Tr}(WS)\leq\gamma\text{, }\\ &\text{ }P\preceq APA^{\mathrm{T}}+W\text{, }\\ &\text{ }\text{ }\text{ }\begin{bmatrix}P-\Pi&PA^{\mathrm{T}}\\ AP&APA^{\mathrm{T}}+W\end{bmatrix}\succeq 0\end{aligned}\right.

(3)

A consequence of [2] (cf. [6] and [8]) is the lower bound $\mathcal{R}(\gamma)\leq\mathcal{L}(\gamma)$ . In the sequel, we will analyze the bitrate of the proposed achievability scheme in terms of $\mathcal{R}(\gamma)$ .

III Achievability Architecture

Variable	Description/Key Equations
$\hat{P}\in\mathbb{S}_{+}^{m}$	Minimizing $P$ of (3), asymptotic KF estimator error
$\hat{P}_{+}\in\mathbb{S}_{+}^{m}$	$\hat{P}_{+}=A\hat{P}A^{\mathrm{T}}+W$ , asymptotic KF prediction error
$C\in\mathbb{R}^{m\times m}$	Measurement matrix, $\hat{P}^{-1}-\hat{P}_{+}^{-1}=C^{\mathrm{T}}C\frac{12}{\Delta^{2}}$
$\Delta\in\mathbb{R}$	Quantizer/dither sensitivity, $\hat{P}^{-1}-\hat{P}_{+}^{-1}=C^{\mathrm{T}}C\frac{12}{\Delta^{2}}$
$K\in\mathbb{R}^{u\times m}$	Certainty equivalent feedback control gain
$\overline{\boldsymbol{{x}}}_{t\|t-1}\in\mathbb{R}^{m}$	TI KF prediction of $\mathbf{x}_{t}$
$\overline{\boldsymbol{{x}}}_{t\|t}\in\mathbb{R}^{m}$	TI KF estimate of $\mathbf{x}_{t}$ . $\mathbf{u}_{t}=K\overline{\boldsymbol{{x}}}_{t\|t}$
$\overline{\boldsymbol{{e}}}_{t}\in\mathbb{R}^{m}$	KF predict error $\overline{\boldsymbol{{e}}}_{t}=\overline{\boldsymbol{{x}}}_{t\|t-1}-\mathbf{x}_{t}$ , $\mathbb{E}[\overline{\boldsymbol{{e}}}_{t}\overline{\boldsymbol{{e}}}_{t}^{\mathrm{T}}]\rightarrow\hat{P}_{+}$
$\boldsymbol{{d}}_{t}\in\mathbb{R}^{m}$	Dither sequence, IID elements $\sim\text{Uniform}(\Delta)$
$\boldsymbol{{q}}_{t}\in\Delta\mathbb{Z}^{m}$	Quantization $\boldsymbol{{q}}_{t}=Q_{\Delta}(C\boldsymbol{{e}}_{t}+\boldsymbol{{d}}_{t})$ , encoded into $\boldsymbol{{a}}_{t}$
$\boldsymbol{{\tilde{q}}}_{t}\in\mathbb{R}^{m}$	Reconstruction $\boldsymbol{{\tilde{q}}}_{t}=\boldsymbol{{q}}_{t}-\boldsymbol{{d}}_{t}$ , error $\boldsymbol{{v}}_{t}=\boldsymbol{{\tilde{q}}}_{t}-C\mathbf{e}_{t}$
$\boldsymbol{{v}}_{t}\in\mathbb{R}^{m}$	$\boldsymbol{{v}}_{t}\perp\mathbf{e}_{t}$ , $\boldsymbol{{v}}_{t}\perp\mathbf{x}_{t}$ , $\mathbb{E}[\boldsymbol{{v}}_{t}\boldsymbol{{v}}_{t+k}^{\mathrm{T}}]=I_{m}\frac{\Delta^{2}}{12}\mathbb{1}_{k=0}$
$\boldsymbol{{y}}_{t}\in\mathbb{R}^{m}$	Effective measurement at decoder $\boldsymbol{{y}}_{t}=C\mathbf{x}_{t}+\mathbf{v}_{t}$

TABLE I:

The achievability approach we proposed is demonstrated in Fig. 2. It is almost identically to the time-varying MIMO achievability approach in [6, Section IV.B] with the exception that the lossless Shannon-Fano-Elias source codec used to encode the quantizations is TI. We first describe the signals in Fig. 2 before summarizing relevant results from [6] that simplify the analysis. Let $\hat{P}$ denote the minimizing $P$ from (3), let $\hat{P}_{+}=A\hat{P}A^{\mathrm{T}}+W$ , and let $\Delta\in\mathbb{R}$ , $\Delta>0$ , and $C\in\mathbb{R}^{m\times m}$ be such that $\hat{P}^{-1}-\hat{P}_{+}^{-1}=C^{\mathrm{T}}C\frac{12}{\Delta^{2}}$ . In this way, $\hat{P}$ ( $\hat{P}_{+}$ ) is interpreted as the asymptotic (prediction) error covariance attained by a Kalman filter tracking the source process $\{\boldsymbol{{x}}_{t}\}$ under a measurement model $\boldsymbol{{y}}_{t}=C\boldsymbol{{x}}_{t}+\boldsymbol{{v}}_{t}$ , where $\mathbb{E}[\boldsymbol{{v}}_{t}\boldsymbol{{v}}_{t+k}^{\mathrm{T}}]=I_{m}\frac{\Delta^{2}}{12}\mathbb{1}_{k=0}$ , $\mathbb{E}[\boldsymbol{{v}}_{t}\boldsymbol{{x}}_{0}^{\mathrm{T}}]=\mathbb{E}[\boldsymbol{{v}}_{t}\boldsymbol{{w}}_{t+k}^{\mathrm{T}}]=0_{m\times m}$ . We will show that this measurement model is attained by the decoder in Fig. 2 using dithered quantization.

Define a uniform elementwise quantizer via $Q_{\Delta}\colon\mathbb{R}^{m}\rightarrow\Delta\mathbb{Z}^{m}$ via $[Q_{\Delta}({x})]_{i}=k\Delta,\text{ if }[{x}]_{i}\in[k\Delta-\Delta/2,k\Delta+\Delta/2),$ e.g. each element of the vector ${x}$ is rounded to the nearest multiple of $\Delta$ . Both the encoder and decoder in Fig. 2 operate identical time invariant Kalman filters that compute recursive estimates of $\boldsymbol{{x}}_{t}$ given measurements received by the decoder. Let $\overline{\boldsymbol{{x}}}_{t|t-1}$ denote the prior estimate at time $t$ and $\boldsymbol{{\overline{x}}}_{t|t}$ denote the posterior. Set $\boldsymbol{{\overline{x}}}_{0|-1}=0$ . At every time $t$ the encoder and decoder produces the Kalman filter prediction error

\displaystyle\boldsymbol{{e}}_{t}=\boldsymbol{{x}}_{t}-\overline{\boldsymbol{{x}}}_{t|t-1},

(4)

and the linear measurement innovation $\boldsymbol{{e}}_{t}$ . It then produces the discrete, dithered quantization $q_{t}=Q_{\Delta}(C\boldsymbol{{e}}_{t}+\boldsymbol{{d}}_{t})$ . It encodes the discrete quantization losslessly into the codeword $\boldsymbol{{a}}_{t}$ which it transmits to the decoder. Upon receiving $\boldsymbol{{a}}_{t}$ , the decoder can recover $\boldsymbol{{q}}_{t}$ exactly. It then produces the reconstruction $\boldsymbol{{\tilde{q}}}_{t}=\boldsymbol{{q}}_{t}-\boldsymbol{{d}}_{t}$ . Let $\boldsymbol{{v}}_{t}=\boldsymbol{{\tilde{q}}}_{t}-C\boldsymbol{{e}}_{t}$ denote the reconstruction error. It then produces the centered measurement $\boldsymbol{{y}}_{t}=\boldsymbol{{\tilde{q}}}_{t}+C\boldsymbol{{x}}_{t|t-1}$ (equivalently $\boldsymbol{{y}}_{t}=C\boldsymbol{{x}}_{t}+\boldsymbol{{v}}_{t}$ ). By [4, Lemma 1], the elements $[\boldsymbol{{v}}_{t}]_{i}$ are mutually independent and uniform on $[-\Delta/2,\Delta/2]$ and furthermore $\boldsymbol{{v}}_{t}\perp\boldsymbol{{x}}_{t}$ . Denote the TI Kalman filter gain $J=\hat{P}_{+}{C}^{\mathrm{T}}({C}\hat{P}_{+}{C}^{\mathrm{T}}+I_{m\times m}\frac{\Delta^{2}}{12})^{-1}$ . The decoder updates its estimate via

\displaystyle\boldsymbol{{\overline{x}}}_{t|t}=\boldsymbol{{\overline{x}}}_{t|t-1}+J(\boldsymbol{{y}}_{t}-C\boldsymbol{{\overline{x}}}_{t|t-1}),

(5)

and applies the certainty equivalent control input $\boldsymbol{{u}}_{t}=K\boldsymbol{{\overline{x}}}_{t|t}$ . It then computes the predict estimate $\boldsymbol{{\overline{x}}}_{t|t-1}={A}\boldsymbol{{\overline{x}}}_{t-1|t-1}+{B}\boldsymbol{{u}}_{t-1}$ . Since the encoder knows the quantizations $\{\boldsymbol{{q}}_{t}\}$ and the dither $\{\boldsymbol{{d}}_{t}\}$ , it can recover the sequence of measurements $\{\boldsymbol{{y}}_{t}\}$ . The encoder can thus compute the sequence of TI Kalman filter estimates $\{\boldsymbol{{\overline{x}}}_{t|t-1},\boldsymbol{{\overline{x}}}_{t|t}\}$ . While we have yet to define how $\boldsymbol{{q}}_{t}$ is encoded into $\boldsymbol{{a}}_{t}$ , we have the following.

Lemma 1

So long as the decoder recovers $\boldsymbol{{q}}_{t}$ exactly at every $t$ , we have that 1) $\boldsymbol{{v}}_{t}\perp\boldsymbol{{e}}^{t},\boldsymbol{{v}}^{t-1}\boldsymbol{{w}}^{t},\boldsymbol{{x}}_{0}$ , 2) $\lim_{t\rightarrow\infty}H(\boldsymbol{{q}})_{t}\leq\mathcal{R}(\gamma)+m\left(1+\frac{1}{2}\log_{2}\left(\frac{2\pi e}{12}\right)\right)$ , 3) $\lim_{T\rightarrow\infty}\sum_{i=0}^{T}\mathbb{E}[\lVert\boldsymbol{{x}}_{t}\rVert_{2}^{2}+\lVert\boldsymbol{{u}}_{t}\rVert_{2}^{2}]\leq\gamma$ .

The first statement follows from [4, Lemma 1] and the system model and the subsequent are via [6, Theorem IV.5].

The TI prefix constraint can be satisfied by encoding each $\boldsymbol{{q}}_{t}$ with the same lossless prefix code at all time. Let $\boldsymbol{{q}}$ be a discrete random variable on $\Delta\mathbb{Z}^{m}$ (the same alphabet as $\boldsymbol{{q}}_{t}$ ). Assume the support of $\boldsymbol{{q}}$ is such that $\mathbb{P}[\boldsymbol{{q}}=k]=0$ only if $\mathbb{P}[\boldsymbol{{q}}_{t}=k]=0$ . Define $\overline{F}_{\boldsymbol{{q}}}(q)=\mathbb{P}[\boldsymbol{{q}}<q]+\frac{\mathbb{P}[\boldsymbol{{q}}=q]}{2}$ , and define the encoding function $C^{\mathrm{F}}_{\boldsymbol{{q}}}(q):\Delta\mathbb{Z}^{m}\rightarrow\{0,1\}^{*}$ such that $C^{\mathrm{F}}_{\boldsymbol{{q}}}(q)$ is the binary expansion of $\overline{F}_{\boldsymbol{{q}}}(q)$ truncated to $\lceil-\log_{2}(\mathbb{P}_{\boldsymbol{{q}}}[q])\rceil+1$ bits. This is a standard, lossless, prefix-free Shannon-Fano-Elias code for the random variable $\boldsymbol{{q}}$ [13]. If we use $C^{\mathrm{F}}_{\boldsymbol{{q}}}$ to encode $\boldsymbol{{q}}_{t}$ (e.g. $\boldsymbol{{a}}_{t}=C^{\mathrm{F}}_{\boldsymbol{{q}}}(\boldsymbol{{q}}_{t})$ ) the codeword length satisfies

\displaystyle\mathbb{E}[\ell(\boldsymbol{{a}}_{t})]\leq 2+H(\boldsymbol{{q}}_{t})+D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}}).

(6)

Note that if for some $k$ , $\mathbb{P}[\boldsymbol{{q}}=k]=0$ while $\mathbb{P}[\boldsymbol{{q}}_{t}=k]>0$ , then $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})=\infty$ . The main result of this paper is the following lemma, proved in Section IV.

Lemma 2

There exists a random variable $\boldsymbol{{q}}$ such that $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})$ is finite for all $t$ and $\lim_{t\rightarrow\infty}D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})=0$ .

This lemma allows us to bound the bitrate for when a TI prefix-free code is used in Fig. 2.

Theorem 3

Let $\eta=\left(1+\frac{1}{2}\log_{2}\left(\frac{2\pi e}{12}\right)\right)$ . Setting $\boldsymbol{{a}}_{t}=C^{\mathrm{F}}_{\boldsymbol{{q}}}(\boldsymbol{{q}}_{t})$ for all $t$ ensures that the system in Fig. 2 attains a control cost less than $\gamma$ , satisfies the prefix constraint in Section II, and attains a communication cost that satisfies

\displaystyle\lim\sup_{T\rightarrow\infty}\frac{1}{T}\sum\limits_{i=0}^{T-1}\mathbb{E}[\ell(\boldsymbol{{a}}_{t})]

\displaystyle\leq

\displaystyle\mathcal{R}(\gamma)+2+m\eta

(7)

Proof:

$C^{\mathrm{F}}_{\boldsymbol{{q}}}$ is a prefix free code fixed over all $t$ . Furthermore, since $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})$ is finite for all $t$ , it is lossless on the support of { $\boldsymbol{{q}}_{t}\}$ . The bound on codeword length (7) follows via applying (6) to each term on the left-hand side and then using the Cesáro mean (cf. Lemma 1). ∎

In Section IV, we use ergodic theory to prove Lemma 2.

IV Proof of Lemma 2

We proceed as in [6] by analyzing the long-term behavior of the Kalman innovation and dither signals, e.g. $(\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t})$ . At every $t$ , the quantization error $\boldsymbol{{v}}_{t}$ is a measurable function of $(\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t})$ given by $\boldsymbol{{v}}_{t}=\boldsymbol{{q}}_{t}-{C}\boldsymbol{{e}}_{t}$ , or, equivalently, $\boldsymbol{{v}}_{t}=Q_{\Delta}({C}\boldsymbol{{e}}_{t}+\boldsymbol{{d}}_{t})-\boldsymbol{{d}}_{t}-{C}\boldsymbol{{e}}_{t}$ . Let $L=AJ$ and ${R}={A}-{L}{C}$ . The following recursion for the $\boldsymbol{{e}}_{t}$ (cf. (4)) can be derived

	$\displaystyle\boldsymbol{{e}}_{t+1}$	$\displaystyle=$	$\displaystyle R\boldsymbol{{e}}_{t}-L\boldsymbol{{v}}_{t}+\boldsymbol{{w}}_{t}$		(8)
		$\displaystyle=$	$\displaystyle M(\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t})+\boldsymbol{{w}}_{t}.$		(9)

By construction of $L$ , $(A-LC)$ is stable, e.g. its eigenvalues lie strictly within the complex unit circle [14]. Define the state space $\mathbb{D}^{m}=\mathbb{R}^{m}\times\mathcal{B}^{m}(\Delta)$ . For ${r},{\mu}\in\mathbb{R}^{m}$ , $\Sigma\in\mathbb{S}_{+}^{m\times m}$ , denote the PDF of a multivariate Gaussian random variable with mean ${\mu}$ and covariance $\Sigma$ evaluated at ${r}$ by $N({r};{\mu},\Sigma)$ , i.e. define the function $N({r};{\mu},\Sigma):\mathbb{R}^{m}\times\mathbb{R}^{m}\times\mathbb{S}_{+}^{m}\rightarrow\mathbb{R}$ via $N({r};{\mu},\Sigma)=\frac{1}{\sqrt{(2\pi)^{m}\det{\Sigma}}}e^{-\frac{1}{2}({r}-{\mu})^{\mathrm{T}}\Sigma^{-1}({r}-{\mu})}$ . Define $M\colon(x,y)\in\mathbb{D}^{m}\rightarrow\mathbb{R}^{m}$ via $M({x},{y})=R{x}-L(Q_{\Delta}(C{x}+{y})-{y}-C{x})$ . By definition, the marginal PDF of the dither is $f_{\boldsymbol{{d}}_{t+1}}({d})=\frac{1}{\Delta^{m}}1_{{d}\in\mathcal{B}^{m}(\Delta)}$ for all $t$ . Since $\boldsymbol{{w}}_{t}\perp(\boldsymbol{{e}}^{t},\boldsymbol{{d}}^{t})$ , and $\boldsymbol{{d}}_{t+1}\perp(\boldsymbol{{e}}^{t+1},\boldsymbol{{w}}^{t}$ ), via (8) we have that $\{\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}\}$ is a time-homogeneous first order Markov chain.

The PDFs of $\{\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}\}$ factorize via $f_{\boldsymbol{{e}}_{t+1},\boldsymbol{{d}}_{t+1}|\boldsymbol{{e}}^{t},\boldsymbol{{d}}^{t}}=f_{\boldsymbol{{d}}_{t+1}}f_{\boldsymbol{{e}}_{t+1}|\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}}$ . Via (8),

	$\displaystyle f_{\boldsymbol{{e}}_{t+1}\|\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}}({e}\|{e}_{p},{d}_{p})$	$\displaystyle=$	$\displaystyle N({e};M({e}_{p},{d}_{p}),W)$		(10)
	$\displaystyle f_{\boldsymbol{{e}}_{t+1}\|\boldsymbol{{e}}_{t},\boldsymbol{{v}}_{t}}({e}\|{e}_{p},{v}_{p})$	$\displaystyle=$	$\displaystyle N({e};R{e}_{p}-L{v}_{p},W).$		(11)

Let $f_{t+1|t}=f_{\boldsymbol{{e}}_{t+1},\boldsymbol{{d}}_{t+1}|\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}}$ , and likewise $f_{t+n|t}=f_{\boldsymbol{{e}}_{t+n},\boldsymbol{{d}}_{t+n}|\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}}$ . From the foregoing, we have

\displaystyle f_{t+1|t}({e},{d}|{e}_{p},{d}_{p})=\frac{1_{{d}\in\mathcal{B}^{m}(\Delta)}N({e};M({e}_{p},{d}_{p}),W)}{\Delta^{m}}

(12)

Applying the Chapman-Komogorov equations to (12), it can be seen that the $n$ -step transition kernels satisfy

\displaystyle f_{t+n|t}({e},{d}|{e}_{p},{d}_{p})=f_{\boldsymbol{{e}}_{t+n}|\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}}({e}|{e}_{p},{d}_{p})\frac{1_{{d}\in\mathcal{B}^{m}(\Delta)}}{\Delta^{m}}.

(13)

We now state the generalization of [6, Lemma A.3].

Lemma 4

Let $\Sigma_{n}=\sum\limits_{i=0}^{n-1}R^{i}W(R^{\mathrm{T}})^{i}$ and

\displaystyle{\mu}_{n}({e}_{0},{d}_{0},{v}_{1}^{n-1})=R^{n-1}M({e}_{0},{d}_{0})-\sum\limits_{i=0}^{n-2}R^{i}L{v}_{n-1-i}.

For all $n\geq 1$ we have

f_{\boldsymbol{{e}}_{n}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{n-1}}({e}_{n}|{e}_{0},{d}_{0},{v}_{1}^{n-1})=\\ N({e}_{n};{\mu}_{n}({e}_{0},{d}_{0},{v}_{1}^{n-1}),\Sigma_{n}).

(14)

Proof:

The proof follows by induction on $n$ . By (10), (14) holds for $n=1$ . Assume that the relation (14) holds for some $n=k-1$ . To avoid clutter, when obvious we suppress the arguments to PMFs, writing, e.g. $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}$ in place of $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}({e}_{k}|{e}_{0},{d}_{0},{v}_{1}^{k-1})$ . Via Bayes’ Theorem, $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}=\frac{f_{\boldsymbol{{e}}_{k},\boldsymbol{{v}}_{k-1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}}{f_{\boldsymbol{{v}}_{k-1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}}$ . Since $\boldsymbol{{v}}_{k-1}$ is a measurable function of $\boldsymbol{{e}}_{k-1}$ and $\boldsymbol{{d}}_{k-1}$ , $\boldsymbol{{v}}_{k-1}$ is conditionally independent of $(\boldsymbol{{v}}_{1}^{k-2},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0})$ given $\boldsymbol{{e}}_{k-1}$ . By Lemma 1, $\boldsymbol{{v}}_{k-1}$ is (pairwise) independent of $\boldsymbol{{e}}_{k-1}$ . This further implies that $\boldsymbol{{v}}_{k-1}$ is independent of $(\boldsymbol{{v}}_{1}^{k-2},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0})$ . Thus we have

	$\displaystyle f_{\boldsymbol{{v}}_{k-1}}$	$\displaystyle=$	$\displaystyle f_{\boldsymbol{{v}}_{k-1}\|\boldsymbol{{e}}_{k-1},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}$		(15)
		$\displaystyle=$	$\displaystyle f_{\boldsymbol{{v}}_{k-1}\|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}$		(16)

Using (15), it can be seen that $f_{\boldsymbol{{e}}_{k},\boldsymbol{{v}}_{k-1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}=f_{\boldsymbol{{v}}_{k-1}}\int_{\mathbb{R}}f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{k-1},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}f_{\boldsymbol{{e}}_{k-1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}d{e}_{k-1}$ . Thus,

f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}=\\ \int_{\mathbb{R}}f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{k-1},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}f_{\boldsymbol{{e}}_{k-1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-2}}d{e}_{k-1}.

(17)

As $\boldsymbol{{w}}_{k-1}\perp(\boldsymbol{{d}}_{0}^{k-1},\boldsymbol{{e}}_{0}^{k-1},\boldsymbol{{w}}_{0}^{t-1},\boldsymbol{{x}}_{0})$ , we have, via (8) that $\boldsymbol{{e}}_{k}$ is conditionally independent of $(\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{0}^{k-2})$ given $(\boldsymbol{{e}}_{k-1},\boldsymbol{{v}}_{k-1})$ . Thus $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{k-1},\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}=f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{k-1},\boldsymbol{{v}}_{k-1}}$ . Using (11) and assuming (14) holds for $n=k-1$ , the integration in (17) is given by $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}=\int_{\mathbb{R}}N({e}_{k};R{e}_{k-1}-L{v}_{k-1},W)N({e}_{t-1};{\mu}_{k-1},\Sigma_{k-1})d{e}_{k-1}$ . This is the convolution of two multivariate Gaussian PDFs.

Computing the convolution, we have $f_{\boldsymbol{{e}}_{k}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1}}=N({e}_{k};R\mu_{k-1}-L{v}_{k-1},R\Sigma_{k-1}R^{\mathrm{T}}+W)$ , e.g. given $(\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{k-1})=({e}_{0},{d}_{0},{v}_{1}^{k-1})$ , $\boldsymbol{{e}}_{k}$ is Gaussian with mean $R{\mu}_{k-1}-L{v}_{k-1}$ and covariance $R\Sigma_{k-1}R^{\mathrm{T}}+W$ . As $\mu_{k}=R\mu_{k-1}-L{v}_{k-1}$ and $\Sigma_{k}=R\Sigma_{k-1}R^{\mathrm{T}}+W$ we have the proof. ∎

The next lemmas describe properties of $\{\Sigma_{n}\}$ and the sequence of functions $\mu_{n}(e_{0},d_{0},v_{1}^{n-2}):\mathbb{R}^{m}\times(\mathcal{B}^{m}(\Delta))^{n-1}\rightarrow\mathbb{R}^{m}$ . We recall a result from system theory.

Lemma 5 (Gelfand’s Theorem and a Corollary [15])

Let $Q\in\mathbb{R}^{m\times m}$ and denote the spectral radius of $Q$ (e.g. the largest of the absolute values of $Q$ ’s eigenvalues) by $\rho_{\max}(Q)$ . Gelfand’s Theorem states that $\lim_{n\rightarrow\infty}\left(\lVert Q^{n}\rVert_{2}\right)^{\frac{1}{n}}=\rho_{\max}(Q)$ . Let $\gamma=(\rho_{\max}(Q)+1)/2$ . If $\rho_{\max}(Q)<1$ then $\gamma<1$ and by Gelfand’s Theorem $\exists$ $i$ such that $\forall$ $j\geq i$ $\lVert Q^{j}\rVert_{2}\leq\gamma^{j}$ .

Lemma 6

We have $\Sigma_{n}\succeq W\succ 0_{m\times m}$ . Furthermore, there exists a constant $c$ such that for all $n$ , $\lVert\Sigma_{n}\rVert_{2}\leq c$ .

Proof:

The first statement is immediate from the assumption that $W\succ 0_{m\times m}$ and the formula for $\Sigma_{n}$ in Lemma 4. Note that since $\Sigma_{n-1}\preceq\Sigma_{n}$ , we have that $\{\lVert\Sigma_{n}\rVert_{2}\}$ is monotonically increasing. Recall that $\rho_{\max}(R)<1$ [14]. Using Lemma 5, it is seen that for some $c<\infty$ , $\lim_{r\rightarrow\infty}\sum\limits_{i=0}^{r}\lVert R^{i}\rVert_{2}^{2}\lVert W\rVert_{2}=c$ . The triangle inequality and the submultiplicity of the matrix 2-norm gives $\lVert\Sigma_{n}\rVert_{2}\leq c$ . ∎

Lemma 7

There exist constants $\alpha$ and $\beta$ such that for any $n$ and choice of ${v}_{1}^{n-2}\in(\mathcal{B}^{m}(\Delta))^{n-2}$ we have $\lVert\mu_{n}\left({e}_{0},{d}_{0},{v}_{1}^{n-1}\right)\rVert_{2}\leq\alpha\lVert M(e_{0},d_{0})\rVert_{2}+\beta$ .

Proof:

The proof is analogous to that of Lemma 6. Lemma 5 is used to bound $\lVert R^{n-1}\rVert_{2}$ and $\sum_{i=0}^{n-2}\lVert R^{i}\rVert_{2}$ . We also use the fact that for ${v}_{j}\in\mathcal{B}^{m}(\Delta)$ , $\lVert{v}_{j}\rVert_{2}\leq\frac{\Delta\sqrt{m}}{2}$ . ∎

For $\alpha$ , $\beta$ , and $c$ as defined in Lemmas 6 and 7, define a subset of $\mathbb{R}^{m}\times\mathbb{R}^{m\times m}$ via $\mathcal{M}(e_{0},d_{0},m,L,R,\Delta)=\{({\mu},\Sigma)\in\mathbb{R}^{m}\times\mathbb{R}^{m\times m}:\lVert{\mu}\rVert_{2}\leq\alpha\lVert M(e_{0},d_{0})\rVert+\beta,\Sigma\succeq W,\lVert\Sigma\rVert_{2}\leq c\}$ . We have that $\mathcal{M}(e_{0},d_{0},m,L,R,\Delta)$ is compact (closed and bounded). This helps prove the main result of this section.

Lemma 8

There exists an invariant PDF $g_{\infty}$ such that

\displaystyle\int_{\mathbb{D}}f_{t+1|t}({e},{d}|{x},{y})g_{\infty}({x},{y})d{x}d{y}=g_{\infty}({e},{d}),

(18)

and $g_{\infty}({x},{y})>0$ for all $({x},{y})\in\mathbb{D}$ .

Proof:

A set $F\in\mathbb{B}(\mathbb{D})$ is called weakly transient with respect to the Markov kernel (12) if there exists a sequence of positive integers $n_{1}<n_{2}<\dots$ such that

\displaystyle\sum_{i=1}^{\infty}\mathbb{P}_{\boldsymbol{{e}}_{n_{i}},\boldsymbol{{d}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[F|\boldsymbol{{e}}_{0}={e}_{0},\boldsymbol{{d}}_{0}={d}_{0}]<\infty

(19)

holds for $\lambda$ almost-every $({e}_{0},{d}_{0})\in\mathbb{D}$ . A consequence of [16, Theorem 5] is that there exists an invariant PDF $g_{\infty}$ satisfying (18) and $g_{\infty}(a,b)>0$ for all $(a,b)\in\mathbb{D}$ if and only if every weakly transient set $F$ has $\lambda(F)=0$ . To prove the lemma, we proceed via contradiction. We prove that any set $F\in\mathbb{B}(\mathbb{D})$ with $\lambda(F)>0$ is not weakly transient. Let $F\subset\mathbb{D}$ have $\lambda(F)>0$ . Since $\lambda(F)>0$ , $F$ must contain an open ball, which in turn must contain a closed rectangle of positive Lebesgue measure, e.g. there exists a $\delta>0$ and point $({e}_{F},{d}_{F})\in F$ such that the set $H=\{(x,y)\in\mathbb{R}^{m}\times\mathbb{R}^{m}:\lVert x-{e}_{F}\rVert_{\infty}\leq\frac{\delta}{2},\lVert y-d_{F}\rVert_{\infty}\leq\frac{\delta}{2}\}$ has $H\subset F$ . Note $\lambda(H)=\delta^{2m}$ . Define the “e-section” of $H$ as the subset $H_{e}=\{x\in\mathbb{R}^{m}:\lVert x-{e}_{F}\rVert_{\infty}\leq\frac{\delta}{2}\}$ . As $H\subset F$ , the conditional probability satisfies

$\displaystyle\mathbb{P}_{\boldsymbol{{e}}_{n_{i}},\boldsymbol{{d}}_{n_{i}}\|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[F\|{e}_{0},{d}_{0}]$	$\displaystyle\geq$	$\displaystyle\int_{H}f_{t+n_{i}\|t}(x,y\|{e}_{0},{d}_{0})dxdy$
	$\displaystyle\geq$	$\displaystyle\delta^{2m}\inf_{({x},{y})\in H}f_{t+n_{i}\|t}({x},{y}\|{e}_{0},{d}_{0}),$
	$\displaystyle\geq$	$\displaystyle\frac{\delta^{2m}}{\Delta^{m}}\inf_{{x}\in H_{e}}f_{\boldsymbol{{e}}_{n_{i}}\|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}({x}\|{e}_{0},{d}_{0}).$

Since the “minimum is less than the average”, we have

\inf_{{x}\in H_{e}}f_{\boldsymbol{{e}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}({x}|{e}_{0},{d}_{0})\geq\\ \inf_{\begin{subarray}{c}{x}\in H_{e}\\ {v}_{1}^{n_{i}-1}\in(\mathcal{B}^{m}(\Delta))^{n-1}\end{subarray}}f_{\boldsymbol{{e}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{n_{i}-1}}({x}|{e}_{0},{d}_{0},{v}_{1}^{n_{i}-1}).

(21)

Using the explicit formula for $f_{\boldsymbol{{e}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{n_{i}-1}}$ given by (14) in Lemma 4, the right side of (21) can be written in terms of $\Sigma_{{n}_{i}}$ and $\mu_{n_{i}}(e_{0},d_{0},v_{1}^{{n}_{i}})$ . However, for a fixed $(e_{0},d_{0})$ , Lemma 7 proves a compact set that contains $\mu_{n_{i}}(e_{0},d_{0},v_{1}^{n_{i}})$ for any choice of $n_{i}$ and $v_{1}^{n_{i}}$ . This set is given by $\mathcal{C}_{{\mu}}=\{{\mu}\in\mathbb{R}^{m}:\lVert\mu\rVert_{2}\leq\alpha\lVert M(e_{0},d_{0})\rVert+\beta\}$ . Likewise, for any $n_{i}$ , $\Sigma_{n_{i}}$ falls within the compact set $\mathcal{C}_{\Sigma}=\{{\Sigma}\in\mathbb{R}^{m\times m}:\Sigma\succeq W,\lVert\Sigma\rVert_{2}\leq c\}$ Thus, we have, for any $n_{i}$

\inf_{\begin{subarray}{c}{x}\in H_{e}\\ {v}_{1}^{n_{i}-1}\in(\mathcal{B}^{m}(\Delta))^{n-1}\end{subarray}}f_{\boldsymbol{{e}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0},\boldsymbol{{v}}_{1}^{n_{i}-1}}({x}|{e}_{0},{d}_{0},{v}_{1}^{n_{i}-1})\geq\\ \inf_{\begin{subarray}{c}{x}\in H_{e}\text{, }{\mu}\in\mathcal{C}_{{\mu}}\text{, }\Sigma\in\mathcal{C}_{\Sigma}\end{subarray}}N(x;{\mu},\Sigma).

(22)

Note that the optimization on the right of (22) does not depend on $n_{i}$ . At every point $(x,\mu,\Sigma)$ where $\Sigma\succ 0_{m\times m}$ , the positive function $N(x;\mu,\Sigma):\mathbb{R}^{m}\times\mathbb{R}^{m}\times\mathbb{R}^{m\times m}\rightarrow\mathbb{R}$ is continuous. Note that the subset $H_{e}\times\mathcal{C}_{\mu}\times\mathcal{C}_{\Sigma}\subset\mathbb{R}^{m}\times\mathbb{R}^{m}\times\mathbb{R}^{m\times m}$ is closed and bounded, thus compact, and that $\Sigma\in\mathcal{C}_{\Sigma}\rightarrow\Sigma\succ 0_{m\times m}$ . Thus we have that there exists $\epsilon>0$ such that $\inf_{\begin{subarray}{c}{x}\in H_{e},{\mu}\in\mathcal{C}_{{\mu}},\text{ }\Sigma\in\mathcal{C}_{\Sigma}\end{subarray}}N(x;{\mu},\Sigma)\geq\epsilon$ . Thus, following the inequalities from (IV) through (22) gives, for any $n_{i}$ that $\mathbb{P}_{\boldsymbol{{e}}_{n_{i}},\boldsymbol{{d}}_{n_{i}}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[F|{e}_{0},{d}_{0}]>\epsilon$ . Thus for any $F$ with $\lambda(F)>0$ the terms in the summation (19) are bounded from below by $\epsilon$ . For any subsequence $\{n_{i}\}$ of $\mathbb{N}$ the summation in (19) diverges, which implies $F$ is not weakly transient. Thus, if $F$ is weakly transient, it must have $\lambda(F)=0$ , which via [16, Theorem 5] proves the lemma. ∎

The next ensures that the sequence $\{\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}\}$ converges to $g_{\infty}$ .

Lemma 9

Let $(\boldsymbol{{e}}_{\infty},\boldsymbol{{d}}_{\infty})\sim g_{\infty}$ , and $(\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0})$ be drawn from a continuous distribution on $\mathbb{D}$ . The sequence of random variables $\{(\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t})\}$ converges weakly to $(\boldsymbol{{e}}_{\infty},\boldsymbol{{d}}_{\infty})$ .

Proof:

We can prove this result using [17, Theorem 4] adapted to the special case when a first-order time-homogeneous Markov chain admits an invariant PDF (18) (in other words. when the chain admits an invariant measure that is absolutely continuous with respect to Lebesgue measure). A Markov chain $\{\boldsymbol{{g}}_{t}\}$ on a state-space $\mathbb{G}$ is said to be $\phi$ -irreducible if there exists a nonzero $\sigma$ -finite measure $\phi$ such that for all measurable $\mathcal{G}\subset\mathbb{G}$ with $\phi(\mathcal{G})>0$ and all initial conditions $\boldsymbol{{z}}_{0}={z}_{0}$ with $z_{0}\in\mathbb{D}$ we can find an integer $n$ such that $\mathbb{P}_{\boldsymbol{{z}}_{n}|\boldsymbol{{z}}_{0}}[\boldsymbol{{z}}_{n}\in\mathcal{A}|\boldsymbol{{z}}_{0}={z}_{0}]>0$ . A Markov chain $\{\boldsymbol{{g}}_{t}\}$ on $\mathbb{G}$ is aperiodic if there does not exist $d\geq 1$ and disjoint nonempty measurable subsets $\mathcal{G}_{0},\mathcal{G}_{1},\dots\mathcal{G}_{d-1}$ such that when ${g}_{n-1}\in\mathcal{G}_{i}$ , then $\mathbb{P}_{\boldsymbol{{g}}_{n}|\boldsymbol{{g}}_{n-1}}[\boldsymbol{{g}}_{n}\in\mathcal{G}_{i+1\text{ mod }d}|\boldsymbol{{z}}_{n-1}={z}_{n-1}]=1$ . A consequence of [17, Theorem 4] is that if the Markov chain $\{\boldsymbol{{g}}_{t}\}$ admits an invariant PDF $g_{\infty}$ , is $\phi$ -irreducible, aperiodic, and $\boldsymbol{{g}}_{0}$ is a continuous random variable then the $\boldsymbol{{g}}_{i}$ converge weakly to the measure induced by the invariant PDF. Consider the Markov chain $\{\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}\}$ on $\mathbb{D}$ , and recall that the Lebesgue measure $\lambda$ on $\mathbb{D}$ is countably generated. Take $\mathcal{A}\subset\mathbb{D}$ with $\lambda(\mathcal{A})>0$ , and let $({e}_{0},{d}_{0})\in\mathbb{D}$ be arbitrary. We have

\mathbb{P}_{\boldsymbol{{e}}_{1},\boldsymbol{{d}}_{1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[(\boldsymbol{{e}}_{1},\boldsymbol{{d}}_{1})\in\mathcal{A}|(\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0})=({e}_{0},{d}_{0})]=\\ \int_{\mathcal{A}}f_{t+1|t}({e},{d}|{e}_{0},{d}_{0})ddde.

(23)

It is immediate from the defition of $f_{t+1|t}$ in (12) and the fact that $\mathcal{A}\subset\mathbb{D}$ that $\mathbb{P}_{\boldsymbol{{e}}_{1},\boldsymbol{{d}}_{1}|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[(\boldsymbol{{e}}_{1},\boldsymbol{{d}}_{1})\in\mathcal{A}|(\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0})({e}_{0},{d}_{0})]\geq 0$ . Thus the Markov chain $\{\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t}\}$ is $\lambda-$ irreducible.

Via the same logic, it can be shown by contradiction that the chain is aperiodic. Assume that the chain is periodic, e.g. that for $d\geq 2$ , there exist disjoint nonempty measurable subsets $\mathcal{A}_{0},\mathcal{A}_{1},\dots,\mathcal{A}_{d-1}\subset\mathbb{D}$ satisfying such that if $({e}_{n-1},{d}_{n-1})\in\mathcal{A}_{i}$ then $\mathbb{P}_{\boldsymbol{{e}}_{n},\boldsymbol{{d}}_{n}|\boldsymbol{{e}}_{n-1},\boldsymbol{{d}}_{n-1}}[(\boldsymbol{{e}}_{n},\boldsymbol{{d}}_{n})\in\mathcal{A}_{i+1\text{ mod }d}|\boldsymbol{{z}}_{n-1}={z}_{n-1}]=1$ . Assume that $\mathcal{A}_{i}$ has $\lambda(\mathcal{A}_{i})>0$ (since the sets $\{\mathcal{A}_{k}\}$ partition $\mathbb{D}$ , at least one of the sets must have positive Lebesgue measure). Take $(e_{t},d_{t})\in\mathcal{A}_{i}$ , and let $j=i+1\text{ mod }d$ . By the assumption of periodicity

\mathbb{P}_{t+1|t}[(\boldsymbol{{e}}_{t+1},\boldsymbol{{d}}_{t+1})\in\mathcal{A}_{j}|(\boldsymbol{{e}}_{t},\boldsymbol{{d}}_{t})=({e}_{t},{d}_{t})]=1.

(24)

If $\mathcal{A}_{i}$ has $\lambda(\mathcal{A}_{i})>0$ , for any $(e_{t},d_{t})$ our proof of irreducibility guarantees that $\exists$ $\epsilon>0$ such that $\mathbb{P}[(\boldsymbol{{e}}_{t+1},\boldsymbol{{d}}_{t+1})\in\mathcal{A}_{i}|\boldsymbol{{e}}_{t}={e}_{t},\boldsymbol{{d}}_{t}={d}_{t}]\geq\epsilon$ . As we assumed $\mathcal{A}_{i}\cap\mathcal{A}_{j}=\emptyset$ , this contradicts (24). Thus the Markov chain $\{\boldsymbol{{e}}_{i},\boldsymbol{{d}}_{i}\}$ is aperiodic, and by [17, Theorem 4] we have the proof. ∎

The next lemma characterizes the invariant PDF $g_{\infty}$ .

Corollary 10

Assume $(\boldsymbol{{e}},\boldsymbol{{d}})\sim g_{\infty}$ . Denote the marginal PDF of $\boldsymbol{{e}}$ via $g_{\boldsymbol{{e}}_{\infty}}(e)=\int_{\mathcal{B}^{m}_{\Delta}}g_{\infty}(e,\delta)d\delta$ . We have that $\boldsymbol{{e}}\perp\boldsymbol{{d}}$ and $\boldsymbol{{d}}\sim\text{Uniform}[-\Delta/2,\Delta/2]$ . This implies that, $g_{\infty}\colon\mathbb{D}\rightarrow\mathbb{R}$ factorizes via $g_{\infty}(e,d)=g_{\boldsymbol{{e}}_{\infty}}(e)/\Delta^{m}$ for $(e,d)\in\mathbb{D}$ .

Proof:

Let $\mathcal{A}_{1}$ be an open ball in $\mathbb{R}^{m}$ and $\mathcal{A}_{2}$ be an open ball inside $\mathcal{B}^{m}(\Delta)$ . Open intervals like $\mathcal{A}_{1}\times\mathcal{A}_{2}$ form a $\pi$ system that generates the $\sigma-$ algebra $\mathbb{B}(\mathbb{D})$ . Using the definition of the transition kernel (12) and the invariant PDF (18), it can be shown that if $\mathcal{P}=\mathcal{A}_{1}\times\mathcal{A}_{2}$ then, $\mathbb{P}[(\boldsymbol{{e}},\boldsymbol{{d}})\in\mathcal{P}]=\int_{\mathcal{A_{1}}}g_{\boldsymbol{{e}}_{\infty}}(e)de\frac{\lambda(\mathcal{A}_{2})}{\Delta^{m}}$ . By Dynkin’s $\pi-\lambda$ theorem, this proves that $\boldsymbol{{e}}\perp\boldsymbol{{d}}$ (see e.g., [18, Prop. 2.13]). ∎

Let $(\boldsymbol{{e}},\boldsymbol{{d}})\sim g_{\infty}$ , and define $\boldsymbol{{q}}=Q_{\Delta}(C\boldsymbol{{e}}+\boldsymbol{{d}})$ . We now prove that $\boldsymbol{{q}}_{t}$ converges to $\boldsymbol{{q}}$ in the KL sense. This is not obvious a priori as $\boldsymbol{{q}}_{t}$ and $\boldsymbol{{q}}$ have countably infinite support.

Lemma 11

$D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})<\infty$ and $\underset{t\rightarrow\infty}{\lim}D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})=0$ .

Proof:

Using the properties of the invariant measure and a data processing inequality for f-divergences [19, Theorem 2.2 (6)] one can prove that the sequence of relative entropies is monotonically decreasing, i.e., that $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t+1}||\boldsymbol{{q}})\leq D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})$ for all $t$ . Since $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})\geq 0$ , the limit as $t\rightarrow\infty$ exists. From applying the data processing inequality for KL divergences several times (cf. [19, Theorem 2.2 (6) ]), and the fact that both $\boldsymbol{{d}}_{t}\perp\boldsymbol{{e}}_{t}$ and by Corollary 10 that both $\boldsymbol{{d}}\perp\boldsymbol{{e}}$ and $\boldsymbol{{d}}_{t}$ and $\boldsymbol{{d}}$ are identically distributed, we have that $D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})\leq D_{\mathrm{KL}}(\boldsymbol{{e}}_{t}||\boldsymbol{{e}})$ . Details can be found in the proof of [6, Lemma IV.10]. The rest of the proof follows from bounding $D_{\mathrm{KL}}(\boldsymbol{{e}}_{t}||\boldsymbol{{e}})$ .

Let $\{\boldsymbol{{\nu}}_{t}\}$ denote an IID sequence of random variables uniformly distributed on $\mathcal{B}^{m}(\Delta)$ . Likewise, let $\{\boldsymbol{{\omega}}_{t}\}$ be IID with $\boldsymbol{{\omega}}_{t}\sim\mathcal{N}(0,W)$ , and let $\boldsymbol{{\lambda}}\sim\mathcal{N}(0,X_{0})$ . Assume $\{\boldsymbol{{\omega}}_{t}\}$ , $\{\boldsymbol{{\nu}}_{t}\}$ , and $\boldsymbol{{\lambda}}$ are mutually independent. Let “ $\boldsymbol{{a}}\overset{D}{=}\boldsymbol{{b}}$ ” denote “ $\boldsymbol{{a}}$ and $\boldsymbol{{b}}$ are identically distributed”. By unwrapping the recursive definition of $\{\boldsymbol{{e}}_{t}\}$ in (8), we have $\boldsymbol{{e}}_{t}\overset{D}{=}R^{t}\boldsymbol{{\lambda}}+\sum_{i=0}^{t-1}R^{i}(\boldsymbol{{\omega}}_{i}-L\boldsymbol{{\nu}}_{i})$ . Likewise, by definition of $\boldsymbol{{e}}$ , we have that both $\boldsymbol{{e}}\overset{D}{=}\lim_{t\rightarrow\infty}R^{t}\boldsymbol{{\lambda}}+\sum_{i=0}^{t-1}R^{i}(\boldsymbol{{\omega}}_{i}-L\boldsymbol{{\nu}}_{i})$ and $\boldsymbol{{e}}\overset{D}{=}\lim_{t\rightarrow\infty}\sum_{i=0}^{t-1}R^{i}(\boldsymbol{{\omega}}_{i}-L\boldsymbol{{\nu}}_{i})$ following from the weak convergence guaranteed by Lemma 9. Define the random variables $\boldsymbol{{g}}_{\leq t}=\sum_{i=0}^{t-1}R^{i}\boldsymbol{{\omega}}_{i}$ , $\boldsymbol{{u}}_{\leq t}=-\sum_{i=0}^{t-1}R^{i}L\boldsymbol{{\nu}}_{i}$ , and $\boldsymbol{{s}}_{>t}=\lim_{r\rightarrow\infty}\sum_{i=t}^{r}R^{i}(\boldsymbol{{\omega}}_{i}-L\boldsymbol{{\nu}}_{i})$ . Note that the limit in the definition of $\boldsymbol{{s}}_{>t}$ is well defined via Lemma 5 in concert with Kolmogorov’s two-series theorem. By definition, $\boldsymbol{{e}}_{t}\overset{D}{=}R^{t}\boldsymbol{{\lambda}}+\boldsymbol{{g}}_{\leq t}+\boldsymbol{{u}}_{\leq t}$ and $\boldsymbol{{e}}\overset{D}{=}\boldsymbol{{g}}_{\leq t}+\boldsymbol{{u}}_{\leq t}+\boldsymbol{{s}}_{>t}$ . Note that $\boldsymbol{{g}}_{\leq t}\sim\mathcal{N}(0,\sum\limits_{i=0}^{t-1}RWR^{\mathrm{T}})$ . We have

	$\displaystyle D_{\mathrm{KL}}(\boldsymbol{{e}}_{t}\|\|\boldsymbol{{e}})$	$\displaystyle\leq$	$\displaystyle D_{\mathrm{KL}}(R^{t}\boldsymbol{{\lambda}}+\boldsymbol{{g}}_{\leq t}\|\|\boldsymbol{{g}}_{\leq t}+\boldsymbol{{s}}_{>t})$		(25)
		$\displaystyle\leq$	$\displaystyle D_{\mathrm{KL}}(R^{t}\boldsymbol{{\lambda}}+\boldsymbol{{g}}_{\leq t}\|\|\boldsymbol{{g}}_{\leq t}+\boldsymbol{{s}}_{>t}\|\boldsymbol{{s}}_{>t}),$		(26)

where (25) follows from the data processing inequality for f-divergences and (26) follows since conditioning increases KL divergence (see [19, Theorem 2.2 (5)]). Given $\boldsymbol{{s}}_{>t}=s$ , (26) simplifies to a KL divergence between multivariate Gaussians. Let $S_{t}=\sum_{i=0}^{t-1}RWR^{\mathrm{T}}$ and $\overline{S}_{t}=S_{t}+R^{t}X_{0}(R^{\mathrm{T}})^{t}$ . Since $\boldsymbol{{\lambda}}\perp\boldsymbol{{g}}_{\leq t}$ by construction, $R^{t}\boldsymbol{{\lambda}}+\boldsymbol{{g}}_{\leq t}\sim\mathcal{N}(0,\overline{S}_{t})$ . Recall also that $\boldsymbol{{g}}_{\leq t}\perp\boldsymbol{{s}}_{>t}$ by construction. We have, then that $D_{\mathrm{KL}}(R^{t}\boldsymbol{{\lambda}}+\boldsymbol{{g}}_{\leq t}||\boldsymbol{{g}}_{\leq t}+\boldsymbol{{s}}_{>t}|\boldsymbol{{s}}_{>t}=s)=D_{\mathrm{KL}}(\mathcal{N}(0,\overline{S}_{t})||\mathcal{N}(s,S_{t}))$ . Let $f_{t}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ be defined via

\displaystyle f_{t}(s)=\text{Tr}(S_{t}^{-1}\overline{S}_{t})+{s}^{\mathrm{T}}S_{t}^{-1}{s}+\ln{\left(\det{S_{t}}/\det{\overline{S}_{t}}\right)}.

(27)

We have that $D_{\mathrm{KL}}(\mathcal{N}(0,\overline{S}_{t})||\mathcal{N}(s,S_{t}))=\frac{1}{2}(f_{t}(s)-m)$ , where the divergence is expressed in nats. Thus, via (26),

\displaystyle D_{\mathrm{KL}}(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})

\displaystyle\leq

\displaystyle\frac{\mathbb{E}\left[(f_{t}(\boldsymbol{{s}}_{>t})-m)\right]}{2}.

(28)

Since $\boldsymbol{{s}}_{>t}\in\mathcal{L}^{2}$ and $S_{t},\overline{S}_{t}\succ W$ , it is clear that the right-hand side of (28) is finite for all $t$ . We now demonstrate that $\lim_{t\rightarrow\infty}\mathbb{E}\left[(f_{t}(\boldsymbol{{s}}_{>t})-m)\right]=0$ , attacking the terms in (27) one-at-a-time. As $\overline{S}_{t}\succeq{S}_{t}\succeq W\succ 0_{m\times m}$ , the determinants $\det{{S}_{t}}$ and $\det{\overline{S}_{t}}$ are bounded strictly away from $0$ . Since $\rho_{\max}(R)<1$ , we have that $Q=\lim_{t\rightarrow\infty}S_{t}$ is well defined, and that $\lim_{t\rightarrow\infty}\overline{S}_{t}=Q$ . Thus

\displaystyle\lim_{t\rightarrow\infty}\ln\left(\det{S_{t}}/\det{\overline{S}_{t}}\right)=0\text{ and }\lim_{t\rightarrow\infty}\text{Tr}(S_{t}^{-1}\overline{S}_{t})=m.

We now prove $\lim_{t\rightarrow\infty}\mathbb{E}[\boldsymbol{{s}}^{\mathrm{T}}S_{t}^{-1}\boldsymbol{{s}}]=0$ . Fix $t$ and let $\boldsymbol{{p}}_{t;r}=\sum_{i=t}^{r}R^{i}(\boldsymbol{{\omega}}_{i}-L\boldsymbol{{\nu}}_{i})$ . For any $t$ , $\lim_{r\rightarrow\infty}\boldsymbol{{p}}_{t;r}\boldsymbol{{p}}_{t;r}^{\mathrm{T}}=\boldsymbol{{s}}_{>t}\boldsymbol{{s}}_{>t}^{\mathrm{T}}$ . These limits are well defined by Kolmogorov’s two-series theorm. Let $U=W+L\frac{\Delta^{2}}{12}L^{\mathrm{T}}$ . We have

	$\displaystyle\mathbb{E}[\boldsymbol{{s}}^{\mathrm{T}}_{>t}\overline{S}^{-1}_{t}\boldsymbol{{s}}_{>t}]$	$\displaystyle=$	$\displaystyle\mathbb{E}[\text{Tr}\left(\overline{S}^{-1}_{t}\boldsymbol{{s}}_{>t}\boldsymbol{{s}}^{\mathrm{T}}_{>t}\right)]$		(29)
		$\displaystyle\leq$	$\displaystyle\text{Tr}\left(\overline{S}^{-1}_{t}\underset{r\rightarrow\infty}{\lim\inf}\text{ }\mathbb{E}\left[\boldsymbol{{p}}_{t;r}\boldsymbol{{p}}_{t;r}^{\mathrm{T}}\right]\right).$		(30)

where (30) follows from Fatou’s lemma, and linearity. We have $\mathbb{E}\left[\boldsymbol{{p}}_{t;r}\boldsymbol{{p}}_{t;r}^{\mathrm{T}}\right]=\sum_{i=t}^{r}R^{i}U(R^{\mathrm{T}})^{i}$ , thus

\lim_{r\rightarrow\infty}\mathbb{E}\left[\boldsymbol{{p}}_{t;r}\boldsymbol{{p}}_{t;r}^{\mathrm{T}}\right]=R^{t}\left(\lim_{r\rightarrow\infty}\sum_{i=0}^{r}R^{i}U(R^{\mathrm{T}})^{i}\right)(R^{\mathrm{T}})^{t},

(31)

where the limit exists via Lemma 5. Let $N=\left(\lim_{r\rightarrow\infty}\sum_{i=0}^{r}R^{i}\left(W+L\frac{\Delta^{2}}{12}L^{\mathrm{T}}\right)(R^{\mathrm{T}})^{i}\right)$ . We have established that $\mathbb{E}[\boldsymbol{{s}}^{\mathrm{T}}_{>t}\overline{S}^{-1}_{t}\boldsymbol{{s}}_{>t}]\leq\text{Tr}(\overline{S}^{-1}_{t}R^{t}N(R^{\mathrm{T}})^{t})$ . Thus since $\rho_{\max}(R)<1$ taking the limit as $t\rightarrow\infty$ proves $\underset{t\rightarrow\infty}{\lim}\mathbb{E}[\boldsymbol{{s}}^{\mathrm{T}}_{>t}\overline{S}^{-1}_{t}\boldsymbol{{s}}_{>t}]=0$ . Thus, in conclusion $\lim_{t\rightarrow\infty}D(\boldsymbol{{q}}_{t}||\boldsymbol{{q}})\leq\lim_{t\rightarrow\infty}\frac{\mathbb{E}\left[(f_{t}(\boldsymbol{{s}}_{>t})-m)\right]}{2}\leq 0$ . ∎

V Conclusion

For $m$ dimensional plants, we have proven the existence of a time-invariant data compression architecture and controller that can achieves any feasible LQG control performance given a feedback bitrate within approximately $2+1.26m$ bits of a known lower bound. Using the “Shannon-type” source codes described in [6, Section IV.A], it can be shown that this overhead can be reduced to $1+1.26m$ bits. The difference between the lower bound and the rate achieved by the dither free, but time-varying, achievability architecture in [5] is logarithmic in the plant dimension (i.e. $\mathcal{O}(\log(m))$ ). This follows from the use of more sophisticated lattice quantizers [5]. An opportunity for future work is to demonstrate TI achievability without the use of dithering and with more sophisticated quantization.

While we have proved that the minimum bitrate can nearly be achieved with a time-invariant source codec, there still remains the problem of developing a practical implementation of this source coding scheme. Numerical experiments would lend credence to these theoretical results. While the time-varying achievability approaches in [4] [5] require the precise construction of codebooks in accordance with the PMF of the quantizer output, the long-term analyses of Section IV could be useful in proving bounds on the performance of adaptive compression schemes based on, for example, [20].

References

[1] H. Jung, A. Pedram, T. Cuvelier, and T. Tanaka, “Optimized data rate allocation for dynamic sensor fusion over resource constrained communication networks,” Int. J. Robust and Nonlinear Control, 2021.
[2] T. C. Cuvelier, T. Tanaka, and R. W. Heath, “A lower-bound for variable-length source coding in linear-quadratic-gaussian control with shared randomness,” IEEE Contr. Syst. Lett., pp. 1–1, 2022.
[3] E. I. Silva, M. S. Derpich, and J. Ostergaard, “A framework for control system design subject to average data-rate constraints,” IEEE Trans. Automat. Cont., vol. 56, no. 8, pp. 1886–1899, 2011.
[4] T. Tanaka, K. H. Johansson, T. Oechtering, H. Sandberg, and M. Skoglund, “Rate of prefix-free codes in LQG control systems,” in Proc. IEEE ISIT, 2016, pp. 2399–2403.
[5] V. Kostina and B. Hassibi, “Rate-cost tradeoffs in control,” IEEE Trans. Automat. Cont., vol. 64, no. 11, pp. 4525–4540, 2019.
[6] T. Cuvelier, T. Tanaka, and R. Heath, “Prefix-free coding for LQG control,” arXiv:2204.00588, 2022.
[7] J. Massey, “Causality, feedback and directed information,” in Proc. IEEE ISIT, 1990, pp. 303–305.
[8] T. Tanaka, P. M. Esfahani, and S. K. Mitter, “LQG control with minimum directed information: Semidefinite programming approach,” IEEE Trans. Automat. Contr., vol. 63, no. 1, pp. 37–52, 2018.
[9] M. S. Derpich and J. Østergaard, “Comments on ‘A Framework for Control System Design Subject to Average Data-Rate Constraints’,” arXiv:2103.12897, 2021.
[10] A. Khina, Y. Nakahira, Y. Su, and B. Hassibi, “Algorithms for optimal control with fixed-rate feedback,” in Proc. IEEE CDC, 2017, pp. 6015–6020.
[11] B. Amini and R. R. Bitmead, “LQG control performance with low-bitrate periodic coding,” IEEE Control of Netw. Syst., vol. 9, no. 1, pp. 320–333, 2022.
[12] N. Guo and V. Kostina, “Optimal causal rate-constrained sampling for a class of continuous Markov processes,” IEEE Transactions on Information Theory, vol. 67, no. 12, pp. 7876–7890, 2021.
[13] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley-Interscience, 1991.
[14] S. Chan, G. Goodwin, and K. Sin, “Convergence properties of the Riccati difference equation in optimal filtering of nonstabilizable systems,” IEEE Trans. Automat. Contr., vol. 29, no. 2, pp. 110–118, 1984.
[15] G. Dullerud and F. Paganini, A Course in Robust Control Theory. Springer, 2000.
[16] Y. Ito, “Invariant measures for markov processes,” Transactions of the American Mathematical Society, vol. 110, no. 1, pp. 152–184, 1964.
[17] G. O. Roberts and J. S. Rosenthal, “General state space Markov chains and MCMC algorithms,” Probability Surveys, vol. 1, no. none, pp. 20 – 71, 2004.
[18] G. Žitković, “Theory of probability (lecture notes),” Fall 2013. [Online]. Available: https://web.ma.utexas.edu/users/gordanz/lecture_notes_page.html
[19] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” 2019. [Online]. Available: {http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf}
[20] D. Bontemps, “Universal coding on infinite alphabets: Exponentially decreasing envelopes,” IEEE Trans. Inf. Theory, vol. 57, no. 3, pp. 1466–1478, 2011.

$\displaystyle\mathbb{P}_{\boldsymbol{{e}}_{n_{i}},\boldsymbol{{d}}_{n_{i}}\|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}[F\|{e}_{0},{d}_{0}]$	$\displaystyle\geq$	$\displaystyle\int_{H}f_{t+n_{i}\|t}(x,y\|{e}_{0},{d}_{0})dxdy$
	$\displaystyle\geq$	$\displaystyle\delta^{2m}\inf_{({x},{y})\in H}f_{t+n_{i}\|t}({x},{y}\|{e}_{0},{d}_{0}),$
	$\displaystyle\geq$	$\displaystyle\frac{\delta^{2m}}{\Delta^{m}}\inf_{{x}\in H_{e}}f_{\boldsymbol{{e}}_{n_{i}}\|\boldsymbol{{e}}_{0},\boldsymbol{{d}}_{0}}({x}\|{e}_{0},{d}_{0}).$