This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improved Channel Coding Performance Through Cost Variability

Adeel Mahmood and Aaron B. Wagner
School of Electrical and Computer Engineering, Cornell University
Abstract

Channel coding for discrete memoryless channels (DMCs) with mean and variance cost constraints has been recently introduced. We show that there is an improvement in coding performance due to cost variability, both with and without feedback. We demonstrate this improvement over the traditional almost-sure cost constraint (also called the peak-power constraint) that prohibits any cost variation above a fixed threshold. Our result simultaneously shows that feedback does not improve the second-order coding rate of simple-dispersion DMCs under the peak-power constraint. This finding parallels similar results for unconstrained simple-dispersion DMCs, additive white Gaussian noise (AWGN) channels and parallel Gaussian channels.

Index Terms:
Channel coding, feedback communications, second-order coding rate, stochastic control.

I Introduction

Channel coding is a fundamental problem focused on the reliable transmission of information over a noisy channel. Information transmission with arbitrarily small error probability is possible at all rates below the capacity CC of the channel, if the number nn of channel uses (also called the blocklength) is permitted to grow without bound [1]. At finite blocklengths, there is an unavoidable backoff from capacity due to the random nature of the channel. The second-order coding rate (SOCR) ([2, 3, 4, 5, 6]) quantifies the O(n1/2)O(n^{-1/2}) convergence to the capacity.

In many practical scenarios, the channel input is subject to some cost constraints which limit the amount of resources that can be used for transmission. With a cost constraint present, the role of capacity is replaced by the capacity-cost function [1, Theorem 6.11]. One common form of the cost constraint is the almost-sure (a.s.) cost constraint ([3, 7]) which bounds the time-average cost of the channel input XnX^{n} over all messages, realizations of any side randomness, channel noise (if there is feedback), etc.:

1ni=1nc(Xi)Γalmost surely,\displaystyle\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\leq\Gamma\quad\text{almost surely,} (1)

where c()c(\cdot) is the cost function. Under the almost-sure (a.s.) cost constraint, the optimal first-order coding rate is the capacity-cost function, the strong converse holds [1, Theorem 6.11], and the optimal SOCR is also known [3, Theorem 3].

The a.s. cost constraint is quite unforgiving, never allowing the cost to exceed the threshold under any circumstances. Our first result (Theorem 1) shows that the SOCR can be strictly improved by merely allowing the cost to fluctuate above the threshold in a manner consistent with a noise process, i.e., the fluctuations have a variance of O(1/n)O(1/n). Our second result (Theorem 2) shows that the a.s. cost framework does not allow feedback improvement to SOCR for simple-dispersion111See Definition 1. DMCs. This again contrasts with the scenario where random fluctuations with variance O(1/n)O(1/n) above the threshold are allowed, as shown in [8, Theorem 3]. This highlights the important role cost variability plays in enabling feedback mechanisms to improve coding performance.

These findings raise the question of whether it is necessary to impose a constraint as stringent as (1). Cost constraints in communication systems are typically imposed to achieve goals such as operating circuitry in the linear regime, minimizing power consumption, and reducing interference with other terminals. It is worth noting that these goals do not always necessitate the use of the strict a.s. cost constraint. For example, the expected cost constraint is often used in wireless communication literature (see, e.g., [9, 10, 11, 12]) because it allows for a dynamic allocation of power based on the current channel state. The expected cost constraint bounds the cost averaged over time and the ensemble:

𝔼[1ni=1nc(Xi)]Γ.\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right]\leq\Gamma. (2)

Yet, if the a.s. constraint is too strict, the expectation constraint is arguably too weak. The expectation constraint allows highly non-ergodic use of power, as shown in Section II-A, which is problematic both from the vantage points of operating circuitry in the linear regime and interference management.

The O(1/n)O(1/n) variance allowance is a feature of a new cost formulation, referred to as mean and variance cost constraints in [8]. This formulation replaces (1) with the following conditions:

𝔼[1ni=1nc(Xi)]\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right] Γ,\displaystyle\leq\Gamma, (3)
Var(1ni=1nc(Xi))\displaystyle\text{Var}\left(\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right) Vn.\displaystyle\leq\frac{V}{n}. (4)

The mean and variance cost constraints were introduced as a relaxed version of the a.s. cost constraint that permits a small amount of stochastic fluctuation above the threshold Γ\Gamma while providing an ergodicity guarantee. Consider a random channel codebook whose codewords satisfy (3)(\ref{exp01}) with equality. For a given input xnx^{n}, define an ergodicity metric m\mathcal{E}_{m} as

m(xn):=max(1ni=1nc(xi)Γ,0)Γ.\displaystyle\mathcal{E}_{m}(x^{n}):=\frac{\max\left(\frac{1}{n}\sum_{i=1}^{n}c(x_{i})-\Gamma,0\right)}{\Gamma}. (5)

The definition in (5)(\ref{h23}) only penalizes cost variation above the threshold and normalizes by the mean cost Γ\Gamma. Let α>0\alpha>0 be the desired ergodicity parameter. We say that a transmission xnx^{n} is α\alpha-ergodic if m(xn)α\mathcal{E}_{m}(x^{n})\leq\alpha. Let β\beta be the desired uncertainty parameter. We say that a random codebook is (α,β)(\alpha,\beta)-ergodic if (m(Xn)α)1β\mathbb{P}\left(\mathcal{E}_{m}(X^{n})\leq\alpha\right)\geq 1-\beta, where XnX^{n} is a random transmission from the codebook.

Under the mean and variance cost formulation, we have (m(Xn)α)1β\mathbb{P}\left(\mathcal{E}_{m}(X^{n})\leq\alpha\right)\geq 1-\beta if

nnc:=Vβα2Γ2,\displaystyle n\geq n_{c}:=\frac{V}{\beta\alpha^{2}\Gamma^{2}}, (6)

where we call ncn_{c} the critical blocklength. Thus, the critical blocklength specifies the minimum blocklength of a channel code for which transmission behaves ergodically with high probability. For fixed α\alpha, β\beta, and Γ\Gamma, the parameters ncn_{c} and VV are in one-to-one correspondence, so one can view the choice of VV in (4) as specifying the critical blocklength. Note that with an expectation-only constraint, we effectively have V=V=\infty, so the transmission is not guaranteed to be ergodic at any blocklength. Furthermore, unlike the expected cost constraint, the mean and variance cost formulation:

  • allows for a strong converse [13, Theorem 77], [8],

  • allows for a finite second-order coding rate [8],

  • does not allow blasting power on errors in the feedback case [14].

The results of this paper also have significance in the context of previous works. Our result in Theorem 2 extends the previously known result that feedback does not improve the second-order performance for simple-dispersion DMCs without cost constraints [4]. It is also similar to the result in [15] that feedback does not improve the second-order performance for AWGN channels.

Random channel coding schemes often use independent and identically distributed (i.i.d.) codewords. It was noted in [16] that the a.s. cost constraint, which is the most commonly considered cost constraint in the context of discrete memoryless channels (DMCs), prohibits the use of i.i.d. codewords. It was shown in [16] that a feedback scheme that uses both i.i.d. and constant-composition codewords leads to an improved SOCR compared to the best non-feedback SOCR achievable under the a.s. cost constraint. Our result in Theorem 2 strengthens the result in [16] by showing that the aforementioned improvement also holds compared to the best feedback SOCR achievable under the a.s. cost constraint.

I-A Related Work

The second- and third-order asymptotics for DMCs with the a.s. cost constraint in the non-feedback setting have been characterized in [3] and [17], respectively. The second-order asymptotics in the feedback setting of DMCs that are not simple-dispersion are studied in [4] without cost constraints. There are more feedback results available for AWGN channels compared to DMCs under the a.s. cost constraint. For example, the result in [15] also addresses the third-order performance with feedback while [18] gives the result that feedback does not improve the second-order performance for parallel Gaussian channels. The second-order performance for the AWGN channel with an expected cost constraint is characterized in [19]. Table I summarizes these results across different settings in channel coding.

Paper Channel Performance Cost Constraint Feedback Non-feedback
Hayashi [3] DMC, AWGN 2nd order a.s. No Yes
Tan and Tomamichel [20] AWGN 3rd order a.s. No Yes
Kostina and Verdú [17] DMC 3rd order a.s. No Yes
Fong and Tan [15] AWGN 2nd and 3rd order a.s. Yes No
Wagner, Shende and Altuğ [4] DMC 2nd order none Yes No
Mahmood and Wagner [8] DMC 2nd order mean and variance Yes Yes
This paper DMC 2nd order mean and variance, a.s. Yes Yes
Polyanskiy [13, Th. 78] Parallel AWGN 2nd order a.s. No Yes
Fong and Tan [18] Parallel AWGN 2nd order a.s. Yes No
Polyanskiy [13, Th. 77] AWGN 1st order expected cost No Yes
Yang et al. [19] AWGN 2nd order expected cost No Yes
TABLE I: Relevant results across different settings in channel coding.

Our proof technique for Theorem 2 is more closely aligned with that used in [4] for DMCs than in [15] for AWGN channels. Both proofs show converse bounds with feedback that match the previously known non-feedback achievability results for DMCs and AWGN channels, respectively. A common technique used in both converse proofs is a result from binary hypothesis testing, which is used in the derivation of Lemma 1 in our paper and a similar result in [15, (17)]. We then proceed with the proof by using a Berry-Esseen-type result for bounded martingale difference sequences whereas [15] uses the usual Berry-Esseen theorem by first showing equality in distribution of the information density with a sum of i.i.d. random variables.

II Preliminaries

Let 𝒜\mathcal{A} and \mathcal{B} be finite input and output alphabets, respectively, of the DMC WW, where WW is a stochastic matrix from 𝒜\mathcal{A} to \mathcal{B}. For a given sequence xn𝒜nx^{n}\in\mathcal{A}^{n}, the nn-type t=t(xn)t=t(x^{n}) of xnx^{n} is defined as

t(a)\displaystyle t(a) =1ni=1n𝟙(xi=a)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathds{1}(x_{i}=a)

for all a𝒜a\in\mathcal{A}, where 𝟙(.)\mathds{1}(.) is the standard indicator function. For a given sequence xn𝒜nx^{n}\in\mathcal{A}^{n}, we will use t(xn)t(x^{n}) or PxnP_{x^{n}} to denote its type. Let 𝒫n(𝒜)\mathcal{P}_{n}(\mathcal{A}) be the set of nn-types on 𝒜\mathcal{A}. For a given t𝒫n(𝒜)t\in\mathcal{P}_{n}(\mathcal{A}), T𝒜n(t)T^{n}_{\mathcal{A}}(t) denotes the type class, i.e., the set of sequences xn𝒜nx^{n}\in\mathcal{A}^{n} with empirical distribution equal to tt. For a random variable ZZ, Z||Z||_{\infty} denotes its essential supremum (that is, the infimum of those numbers zz such that (Zz)=1\mathbb{P}(Z\leq z)=1). We will write log\log to denote logarithm to the base ee and exp(x)\exp(x) to denote ee to the power of xx. The cost function is denoted by c()c(\cdot) where c:𝒜[0,cmax]c:\mathcal{A}\to[0,c_{\max}] and cmax>0c_{\max}>0 is a constant. Let Γ0=mina𝒜c(a)\Gamma_{0}=\min_{a\in\mathcal{A}}c(a). Let Γ\Gamma^{*} denote the smallest Γ\Gamma such that the capacity-cost function C(Γ)C(\Gamma) is equal to the unconstrained capacity. We assume Γ>Γ0\Gamma^{*}>\Gamma_{0} and Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}) throughout the paper. For Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}), the capacity-cost function is defined as

C(Γ)\displaystyle C(\Gamma) =maxP𝒫(𝒜)c(P)ΓI(P,W),\displaystyle=\max_{\begin{subarray}{c}P\in\mathcal{P}(\mathcal{A})\\ c(P)\leq\Gamma\end{subarray}}I(P,W), (7)

where c(P):=a𝒜P(a)c(a)c(P):=\sum_{a\in\mathcal{A}}P(a)c(a). The function C(Γ)C(\Gamma) is strictly increasing and differentiable [1, Problem 8.4] in the interval (Γ0,Γ)(\Gamma_{0},\Gamma^{*}). For a given xn𝒜nx^{n}\in\mathcal{A}^{n}, we define

c(xn):=1ni=1nc(xi).\displaystyle c(x^{n}):=\frac{1}{n}\sum_{i=1}^{n}c(x_{i}).

Let ΠW,Γ\Pi_{W,\Gamma}^{*} be the set of all capacity-cost-achieving distributions, i.e., the set of maximizing distributions in (7)(\ref{main_form}). For any PΠW,ΓP^{*}\in\Pi_{W,\Gamma}^{*}, let Q=PWQ^{*}=P^{*}W be the marginal distribution on \mathcal{B}. Note that the output distribution QQ^{*} is always unique, and without loss of generality, QQ^{*} can be assumed to satisfy Q(b)>0Q^{*}(b)>0 for all bb\in\mathcal{B} [21, Corollaries 1 and 2 to Theorem 4.5.1].

The following definitions will remain in effect throughout the paper:

νa\displaystyle\nu_{a} :=Var(logW(Y|a)Q(Y)), where YW(|a),\displaystyle:=\text{Var}\left(\log\frac{W(Y|a)}{Q^{*}(Y)}\right),\quad\text{ where }Y\sim W(\cdot|a),
νmax\displaystyle\nu_{\max} :=maxa𝒜νa,\displaystyle:=\max_{a\in\mathcal{A}}\nu_{a},
i(a,b)\displaystyle i(a,b) :=logW(b|a)Q(b).\displaystyle:=\log\frac{W(b|a)}{Q^{*}(b)}.
Definition 1 (cf. [4])

A DMC WW is called simple-dispersion at the cost Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}) if

minPΠW,Γa𝒜P(a)νa=maxPΠW,Γa𝒜P(a)νa.\displaystyle\min_{P^{*}\in\Pi_{W,\Gamma}^{*}}\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}=\max_{P^{*}\in\Pi_{W,\Gamma}^{*}}\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}.

We will only focus on simple-dispersion channels for a fixed cost Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}) and thus define

V(Γ):=a𝒜P(a)νaV(\Gamma):=\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}

for any PΠW,ΓP^{*}\in\Pi_{W,\Gamma}^{*}.

With a blocklength nn and a fixed rate R>0R>0, let R={1,,exp(nR)}\mathcal{M}_{R}=\{1,\ldots,\lceil\exp(nR)\rceil\} denote the message set. Let MRM\in\mathcal{M}_{R} denote the random message drawn uniformly from the message set.

Definition 2

An (n,R)(n,R) code for a DMC consists of an encoder ff which, for each message mRm\in\mathcal{M}_{R}, chooses an input Xn=f(m)𝒜nX^{n}=f(m)\in\mathcal{A}^{n}, and a decoder gg which maps the output YnY^{n} to m^R\hat{m}\in\mathcal{M}_{R}. The code (f,g)(f,g) is random if ff or gg is random.

Definition 3

An (n,R)(n,R) code with ideal feedback for a DMC consists of an encoder ff which, at each time instant kk (1kn1\leq k\leq n) and for each message mRm\in\mathcal{M}_{R}, chooses an input xk=f(m,xk1,yk1)𝒜x_{k}=f(m,x^{k-1},y^{k-1})\in\mathcal{A}, and a decoder gg which maps the output yny^{n} to m^R\hat{m}\in\mathcal{M}_{R}. The code (f,g)(f,g) is random if ff or gg is random.

Definition 4

An (n,R,Γ)(n,R,\Gamma) code for a DMC is an (n,R)(n,R) code such that c(Xn)Γc(X^{n})\leq\Gamma almost surely, where the message MUnif(R)M\sim\text{Unif}(\mathcal{M}_{R}) has a uniform distribution over the message set R\mathcal{M}_{R}.

Definition 5

An (n,R,Γ)(n,R,\Gamma) code with ideal feedback for a DMC is an (n,R)(n,R) code with ideal feedback such that c(Xn)Γc(X^{n})\leq\Gamma almost surely, where the message MUnif(R)M\sim\text{Unif}(\mathcal{M}_{R}) has a uniform distribution over the message set R\mathcal{M}_{R}.

Definition 6

An (n,R,Γ,V)(n,R,\Gamma,V) code for a DMC is an (n,R)(n,R) code such that 𝔼[i=1nc(Xi)]nΓ\mathbb{E}\left[\sum_{i=1}^{n}c(X_{i})\right]\leq n\Gamma and Var(i=1nc(Xi))nV\text{Var}\left(\sum_{i=1}^{n}c(X_{i})\right)\leq nV, where the message MUnif(R)M\sim\text{Unif}(\mathcal{M}_{R}) has a uniform distribution over the message set R\mathcal{M}_{R}.

Definition 7

An (n,R,Γ,V)(n,R,\Gamma,V) code with ideal feedback for a DMC is an (n,R)(n,R) code with ideal feedback such that 𝔼[i=1nc(Xi)]nΓ\mathbb{E}\left[\sum_{i=1}^{n}c(X_{i})\right]\leq n\Gamma and Var(i=1nc(Xi))nV\text{Var}\left(\sum_{i=1}^{n}c(X_{i})\right)\leq nV, where the message MUnif(R)M\sim\text{Unif}(\mathcal{M}_{R}) has a uniform distribution over the message set R\mathcal{M}_{R}.

Given ϵ(0,1)\epsilon\in(0,1), define

Mfb(n,ϵ,Γ):=max{exp(nR):P¯e,fb(n,R,Γ)ϵ},\displaystyle M^{*}_{\text{fb}}(n,\epsilon,\Gamma):=\max\{\lceil\exp(nR)\rceil:\bar{P}_{\text{e,fb}}(n,R,\Gamma)\leq\epsilon\},

where P¯e,fb(n,R,Γ)\bar{P}_{\text{e,fb}}(n,R,\Gamma) denotes the minimum average error probability attainable by any random (n,R,Γ)(n,R,\Gamma) code with feedback. Similarly, define

M(n,ϵ,Γ):=max{exp(nR):P¯e(n,R,Γ)ϵ},\displaystyle M^{*}(n,\epsilon,\Gamma):=\max\{\lceil\exp(nR)\rceil:\bar{P}_{\text{e}}(n,R,\Gamma)\leq\epsilon\},

where P¯e(n,R,Γ)\bar{P}_{\text{e}}(n,R,\Gamma) denotes the minimum average error probability attainable by any random (n,R,Γ)(n,R,\Gamma) code without feedback. Define Mfb(n,ϵ,Γ,V)M^{*}_{\text{fb}}(n,\epsilon,\Gamma,V) and M(n,ϵ,Γ,V)M^{*}(n,\epsilon,\Gamma,V) similarly for codes with mean and variance cost constraints.

II-A Expectation-only cost constraint

Under this cost formulation, the average cost of the codewords is constrained in expectation only:

𝔼[1ni=1nc(Xi)]\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right] Γ.\displaystyle\leq\Gamma. (8)

We now illustrate a codebook construction (adapted from [17]) with an average error probability at most ϵ(0,1)\epsilon\in(0,1) that meets the cost threshold Γ\Gamma according to (8)(\ref{exp_cost}) but the cost of its codewords is non-ergodic, i.e., 1ni=1nc(Xi)\frac{1}{n}\sum_{i=1}^{n}c(X_{i}) does not converge to Γ\Gamma. Consider a codebook 𝒞n\mathcal{C}_{n} with rate C(Γ)<R<C(Γ1ϵ)C(\Gamma)<R<C\left(\frac{\Gamma}{1-\epsilon}\right) whose average error probability ϵn0\epsilon_{n}\to 0 and each of whose codewords has average cost equal to Γ1ϵ\frac{\Gamma}{1-\epsilon}. Such a codebook exists because R<C(Γ1ϵ)R<C\left(\frac{\Gamma}{1-\epsilon}\right). Assuming Γ0=mina𝒜c(a)=0\Gamma_{0}=\min_{a\in\mathcal{A}}c(a)=0 without loss of generality, one could modify the codebook 𝒞n\mathcal{C}_{n} by replacing an ϵ\epsilon-fraction of its codewords with the all-zero codeword. The modified codebook 𝒞n\mathcal{C}_{n}^{\prime} has average error probability at most ϵnϵ\epsilon_{n}^{\prime}\to\epsilon and meets the cost threshold Γ\Gamma according to (8)(\ref{exp_cost}). But 1ni=1nc(Xi)\frac{1}{n}\sum_{i=1}^{n}c(X_{i}) is either 0 or Γ1ϵ\frac{\Gamma}{1-\epsilon}. This construction also shows that the strong converse does not hold under the expected cost constraint.

The mean and variance cost constraints ensure that the average cost of the codewords concentrate around the cost threshold Γ\Gamma, thereby disallowing codebook constructions with irregular or non-ergodic power consumption.

III Main Results

We prove coding performance improvement in terms of the second-order coding rate, although equivalent results in terms of the average error probability improvement can also be shown as in [8, Theorems 1-3]. Let

  • ra.s.(ϵ,Γ)r_{\text{a.s.}}(\epsilon,\Gamma) denote the optimal SOCR with the a.s. cost constraint without feedback,

  • ra.s.,fb(ϵ,Γ)r_{\text{a.s.,fb}}(\epsilon,\Gamma) denote the optimal SOCR with the a.s. cost constraint with feedback,

  • rm.v.(ϵ,Γ,V)r_{\text{m.v.}}(\epsilon,\Gamma,V) denote the optimal SOCR with the mean and variance cost constraints without feedback and

  • rm.v.,fb(ϵ,Γ,V)r_{\text{m.v.,fb}}(\epsilon,\Gamma,V) denote the optimal SOCR with the mean and variance cost constraints with feedback

for channel codes operating with average error probability of at most ϵ(0,1)\epsilon\in(0,1). In the non-feedback case, C(Γ)C(\Gamma) is the optimal first-order rate for DMCs with the a.s. cost constraint [1, Theorem 6.11] as well as the mean and variance cost formulation [8, Theorems 1 and 2], i.e.,

limn1nlogM(n,ϵ,Γ)\displaystyle\lim_{n\to\infty}\frac{1}{n}\log M^{*}(n,\epsilon,\Gamma) =C(Γ)\displaystyle=C(\Gamma) (9)
limn1nlogM(n,ϵ,Γ,V)\displaystyle\lim_{n\to\infty}\frac{1}{n}\log M^{*}(n,\epsilon,\Gamma,V) =C(Γ)\displaystyle=C(\Gamma) (10)

for all ϵ(0,1)\epsilon\in(0,1). The results (9)(\ref{str1}) and (10)(\ref{str2}) imply that the strong converse holds. We thus define the second-order rates with respect to the capacity-cost function C(Γ)C(\Gamma) as follows:

ra.s.(ϵ,Γ)\displaystyle r_{\text{a.s.}}(\epsilon,\Gamma) :=lim infnlogM(n,ϵ,Γ)nC(Γ)n\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}
rm.v.(ϵ,Γ,V)\displaystyle r_{\text{m.v.}}(\epsilon,\Gamma,V) :=lim infnlogM(n,ϵ,Γ,V)nC(Γ)n.\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}(n,\epsilon,\Gamma,V)-nC(\Gamma)}{\sqrt{n}}.

For the feedback case, we simply take the convention to define the SOCR with respect to C(Γ)C(\Gamma) as follows:

ra.s.,fb(ϵ,Γ)\displaystyle r_{\text{a.s.,fb}}(\epsilon,\Gamma) :=lim infnlogMfb(n,ϵ,Γ)nC(Γ)n\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}
rm.v.,fb(ϵ,Γ,V)\displaystyle r_{\text{m.v.,fb}}(\epsilon,\Gamma,V) :=lim infnlogMfb(n,ϵ,Γ,V)nC(Γ)n.\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma,V)-nC(\Gamma)}{\sqrt{n}}.

For the a.s. cost constraint, this convention is justified because from the result in Theorem 2, C(Γ)C(\Gamma) is the optimal first-order rate, in the analogous sense to (9)(\ref{str1}), for DMCs with feedback. For DMCs without cost constraints, Shannon [22, Theorem 6] showed that feedback does not increase the capacity.

III-A Performance improvement for non-feedback codes

From [3, Theorem 3], we have ra.s.(ϵ,Γ)=V(Γ)Φ1(ϵ)r_{\text{a.s.}}(\epsilon,\Gamma)=\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon) for a simple-dispersion222The result in [3, Theorem 3] is not restricted to simple-dispersion DMCs. DMC WW. On the other hand, [8, Theorems 1 and 2] proved that

rm.v.(ϵ,Γ,V)=max{r:𝒦(rV(Γ),C(Γ)2VV(Γ))ϵ}\displaystyle r_{\text{m.v.}}(\epsilon,\Gamma,V)=\max\left\{r\in\mathbb{R}:\mathcal{K}\left(\frac{r}{\sqrt{V(\Gamma)}},\frac{C^{\prime}(\Gamma)^{2}V}{V(\Gamma)}\right)\leq\epsilon\right\} (11)

for a DMC WW such that |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1 and V(Γ)>0V(\Gamma)>0, where the function 𝒦:×(0,)(0,1)\mathcal{K}:\mathbb{R}\times(0,\infty)\to(0,1) is given by

𝒦(r,V)\displaystyle\mathcal{K}\left(r,V\right) =minΠ:𝔼[Π]=rVar(Π)V|supp(Π)|3𝔼[Φ(Π)].\displaystyle=\min_{\begin{subarray}{c}\Pi:\\ \mathbb{E}[\Pi]=r\\ \text{Var}(\Pi)\leq V\\ |\text{supp}(\Pi)|\leq 3\end{subarray}}\mathbb{E}\left[\Phi(\Pi)\right]. (12)

The maximum and the minimum in (11)(\ref{max5}) and (12)(\ref{min5}), respectively, are attained [8, Lemmas 3 and 4].

Theorem 1

Fix an arbitrary ϵ(0,1)\epsilon\in(0,1). Then for any Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}), V>0V>0 and a DMC WW such that |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1 and V(Γ)>0V(\Gamma)>0, we have rm.v.(ϵ,Γ,V)>ra.s.(ϵ,Γ)r_{\text{m.v.}}(\epsilon,\Gamma,V)>r_{\text{a.s.}}(\epsilon,\Gamma).

Proof: The proof is given in Section IV.

The improvement in Theorem 1 is shown in Fig. 1 for a binary symmetric channel. Specifically, the second-order coding rate as a function of the average error probability is shown in Fig. 1 and Fig. 2 for a binary symmetric channel with parameter p=0.3p=0.3, alphabets 𝒜=={0,1}\mathcal{A}=\mathcal{B}=\{0,1\}, cost threshold Γ=0.2\Gamma=0.2 and cost function c(x)=xc(x)=x.

Refer to caption
Figure 1: The SOCR is compared between the almost-sure cost constraint and the mean and variance cost constraints for different values of VV. The plots for the mean and variance cost constraints are lower bounds to the SOCR since they are obtained through a non-exhaustive search of the feasible region in the maximization and minimization in (11)(\ref{max5}) and (12)(\ref{min5}), respectively.

As discussed in (6)(\ref{criticaln}), the choice of VV together with the desired values of α\alpha and β\beta specifies the critical blocklength exceeding which guarantees the (α,β)(\alpha,\beta)-ergodicity of the coding scheme. In practice, the choice of blocklength is more fundamental as it affects complexity and latency. Therefore, it is more prudent for the value of VV to be dictated by the blocklength nn and the desired (α,β)(\alpha,\beta)-ergodicity via the relation V=nβα2Γ2V=n\beta\alpha^{2}\Gamma^{2} derived from (6)(\ref{criticaln}). With the same channel (p=0.3)(p=0.3) and cost (Γ=0.2)(\Gamma=0.2) parameters as used in Fig. 1, Fig. 2 shows the second-order performance for different critical blocklengths for an (α,β)(\alpha,\beta)-ergodic codebook with α=β=0.1\alpha=\beta=0.1.

Refer to caption
Figure 2: The SOCR is compared between the almost-sure cost constraint and the mean and variance cost constraints for an (α,β)(\alpha,\beta)-ergodic random codebook with α=β=0.1\alpha=\beta=0.1.

III-B Performance improvement for feedback codes

Definition 8

A controller is a function F:(𝒜×)𝒫(𝒜)F:(\mathcal{A}\times\mathcal{B})^{*}\to\mathcal{P}(\mathcal{A}).

Random feedback codes can be constructed by controllers. Given a message mRm\in\mathcal{M}_{R} and the past channel inputs and outputs (xk1,yk1)(x^{k-1},y^{k-1}), the channel input Xk=f(m,xk1,yk1)X_{k}=f(m,x^{k-1},y^{k-1}) at time instant kk is distributed according to F(xk1,yk1)F(x^{k-1},y^{k-1}). There is a one-to-one correspondence between a random feedback code and a controller-based code.

  • A random feedback code (f,g)(f,g) is equivalently specified by the joint distribution

    pM,Xn,Yn,M^=pM(i=1npXi|M,Xi1,Yi1pYi|Xi)pM^|Yn,\displaystyle p_{M,X^{n},Y^{n},\hat{M}}=p_{M}\left(\prod_{i=1}^{n}p_{X_{i}|M,X^{i-1},Y^{i-1}}p_{Y_{i}|X_{i}}\right)p_{\hat{M}|Y^{n}},

    where M^\hat{M} is the decoded message. Marginalizing out MM and M^\hat{M}, we obtain

    pXn,Yn=i=1npXi|Xi1,Yi1pYi|Xi\displaystyle p_{X^{n},Y^{n}}=\prod_{i=1}^{n}p_{X_{i}|X^{i-1},Y^{i-1}}\,p_{Y_{i}|X_{i}}

    from which a controller FF can be obtained with the specification

    F(xk|xk1,yk1)=pXk|Xk1,Yk1(xk|xk1,yk1)\displaystyle F(x_{k}|x^{k-1},y^{k-1})=p_{X_{k}|X^{k-1},Y^{k-1}}(x_{k}|x^{k-1},y^{k-1})

    for each time kk.

  • Likewise, a controller FF specifies a random feedback code by inducing the following joint distribution:

    pM,Xn,Yn,M^(m,xn,yn,m^)=pM(m)(i=1nF(xi|xi1,yi1)pYi|Xi(yi|xi))pM^|Yn(m^|yn).\displaystyle p_{M,X^{n},Y^{n},\hat{M}}(m,x^{n},y^{n},\hat{m})=p_{M}(m)\left(\prod_{i=1}^{n}F(x_{i}|x^{i-1},y^{i-1})p_{Y_{i}|X_{i}}(y_{i}|x_{i})\right)p_{\hat{M}|Y^{n}}(\hat{m}|y^{n}).

A controller FF along with the channel WW specify a joint distribution over 𝒜n×n\mathcal{A}^{n}\times\mathcal{B}^{n} with probability assignments given by

(FW)(xn,yn)=k=1nF(xk|xk1,yk1)W(yk|xk).\displaystyle(F\circ W)(x^{n},y^{n})=\prod_{k=1}^{n}F(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k}). (13)
Lemma 1

Consider a channel WW with cost constraint Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}). Then for every n,ρ>0n,\rho>0 and ϵ(0,1)\epsilon\in(0,1),

logMfb(n,ϵ,Γ)\displaystyle\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma) logρlog[(1ϵsupFinfq𝒫(n)(FW)(W(Yn|Xn)q(Yn)>ρ))+],\displaystyle\leq\log\rho-\log\left[\left(1-\epsilon-\sup_{F}\,\inf_{q\in\mathcal{P}(\mathcal{B}^{n})}(F\circ W)\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)\right)^{+}\right], (14)

where the supremum in (14)(\ref{qandp}) is over all controllers FF satisfying

(FW)(c(Xn)Γ)\displaystyle(F\circ W)\left(c(X^{n})\leq\Gamma\right) =xn:c(xn)Γynnk=1nF(xk|xk1,yk1)W(yk|xk)\displaystyle=\sum_{x^{n}:c(x^{n})\leq\Gamma}\sum_{y^{n}\in\mathcal{B}^{n}}\prod_{k=1}^{n}F(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k})
=1.\displaystyle\,=1. (15)

Proof: Lemma 1 is similar to [23, Theorem 27], [18, (42)], [4, Lemma 15], [8, Lemma 2] and others. Its proof is omitted.

Theorem 2

Fix an arbitrary ϵ(0,1)\epsilon\in(0,1). Then for any Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}) and a simple-dispersion DMC WW such that V(Γ)>0V(\Gamma)>0, we have ra.s.,fb(ϵ,Γ)=ra.s.(ϵ,Γ)r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma).

Proof: The proof is given in Section V.

We only prove a converse result that is the following upper bound:

ra.s.,fb(ϵ,Γ)V(Γ)Φ1(ϵ).\displaystyle r_{\text{a.s.,fb}}(\epsilon,\Gamma)\leq\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon). (16)

The result of Theorem 2 follows by combining (16)(\ref{4x}) with the existing achievability result (without feedback) from [3, Theorem 3].

From Theorems 1 and 2 alone, we have that rm.v.,fb(ϵ,Γ,V)>ra.s.,fb(ϵ,Γ,V)r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{a.s.,fb}}(\epsilon,\Gamma,V). More importantly, the mean and variance cost formulation admits feedback mechanisms that improve the SOCR, even if the capacity-cost-achieving distribution is unique, i.e., |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1. This is the more interesting case since for compound-dispersion channels where |ΠW,Γ|>1|\Pi_{W,\Gamma}^{*}|>1, feedback is already known to improve second-order performance via timid/bold coding [4].

In contrast to the almost-sure constraint where ra.s.,fb(ϵ,Γ)=ra.s.(ϵ,Γ)r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma), we observe that rm.v.,fb(ϵ,Γ,V)>rm.v.(ϵ,Γ,V)r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{m.v.}}(\epsilon,\Gamma,V) for simple-dispersion channels with |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1 [8, Theorem 3]. In summary, for any ϵ(0,1)\epsilon\in(0,1), Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}), V>0V>0 and a simple-dispersion DMC WW such that |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1 and V(Γ)>0V(\Gamma)>0,

rm.v.,fb(ϵ,Γ,V)>rm.v.(ϵ,Γ,V)>ra.s.,fb(ϵ,Γ)=ra.s.(ϵ,Γ),\displaystyle r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{m.v.}}(\epsilon,\Gamma,V)>r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma),

where the last equality above has been proven without the assumption |ΠW,Γ|=1|\Pi_{W,\Gamma}^{*}|=1.

IV Proof of Theorem 1

Since 𝒦(r,V)\mathcal{K}(r,V) is a continuous function [8, Lemma 3], it suffices to show that for all rr\in\mathbb{R} and V>0V>0,

minΠ:𝔼[Π]=rVar(Π)V|supp(Π)|2𝔼[Φ(Π)]<Φ(r).\displaystyle\min_{\begin{subarray}{c}\Pi:\\ \mathbb{E}[\Pi]=r\\ \text{Var}(\Pi)\leq V\\ |\text{supp}(\Pi)|\leq 2\end{subarray}}\mathbb{E}\left[\Phi(\Pi)\right]<\Phi(r). (17)

The LHS of (17)(\ref{bv}) can be written as

minp,π:0p1p1p(πr)2V[pΦ(π)+(1p)Φ(rpπ1p)],\displaystyle\min_{\begin{subarray}{c}p,\pi:\\ 0\leq p\leq 1\\ \frac{p}{1-p}(\pi-r)^{2}\leq V\end{subarray}}\left[p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\right],

where we used the constraint 𝔼[Π]=r\mathbb{E}[\Pi]=r to eliminate one of the decision variables.

Assume by contradiction that

pΦ(π)+(1p)Φ(rpπ1p)Φ(r)\displaystyle p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\geq\Phi(r) (18)

for all πr\pi\geq r, p[0,1]p\in[0,1] and p1p(πr)2V\frac{p}{1-p}(\pi-r)^{2}\leq V. The assumption πr\pi\geq r is without loss of generality since one of the two point masses must be greater than or equal to rr.

If (18)(\ref{eq}) holds, then

pΦ(π)+(1p)Φ(rpπ1p)Φ(r)\displaystyle p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\geq\Phi(r)

for all πr\pi\geq r, p[0,1]p\in[0,1] and p1p(πr)2=V\frac{p}{1-p}(\pi-r)^{2}=V. Since π=r+V(1p)p\pi=r+\sqrt{\frac{V(1-p)}{p}} in this case, we must have

pΦ(r+V(1p)p)+(1p)Φ(r1pp1p(r+V(1p)p))\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(\frac{r}{1-p}-\frac{p}{1-p}\left(r+\sqrt{\frac{V(1-p)}{p}}\right)\right) Φ(r)\displaystyle\geq\Phi(r)
pΦ(r+V(1p)p)+(1p)Φ(rp1pV(1p)p)\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\frac{p}{1-p}\sqrt{\frac{V(1-p)}{p}}\right) Φ(r)\displaystyle\geq\Phi(r)
pΦ(r+V(1p)p)+(1p)Φ(rVp1p)\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right) Φ(r)\displaystyle\geq\Phi(r) (19)

for all p[0,1]p\in[0,1].

Consider the function

f(p)\displaystyle f(p) =pΦ(r+V(1p)p)+(1p)Φ(rVp1p)Φ(r)\displaystyle=p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\Phi(r)

with domain p[0,1]p\in[0,1]. For any p(0,1)p\in(0,1),

f(p)f(0)p\displaystyle\frac{f(p)-f(0)}{p} =Φ(r+V(1p)p)+1ppΦ(rVp1p)Φ(r)p\displaystyle=\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1-p}{p}\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\frac{\Phi(r)}{p}
Φ(r+V(1p)p)+1pΦ(rVp1p)Φ(r)p\displaystyle\leq\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1}{p}\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\frac{\Phi(r)}{p}
=(a)Φ(r+V(1p)p)+1p[Φ(r)ϕ(r~)Vp1pΦ(r)]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1}{p}\left[\Phi(r)-\phi(\tilde{r})\sqrt{\frac{Vp}{1-p}}-\Phi(r)\right]
=Φ(r+V(1p)p)ϕ(r~)Vp(1p).\displaystyle=\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)-\phi(\tilde{r})\sqrt{\frac{V}{p(1-p)}}. (20)

In equality (a)(a), we have rVp1p<r~<rr-\sqrt{\frac{Vp}{1-p}}<\tilde{r}<r by the mean value theorem. It is easy to see that for sufficiently small p>0p>0, the expression in (20)(\ref{co}) is negative. Since f(0)=0f(0)=0, we have f(p)<0f(p)<0 for some p>0p>0, which contradicts (19)(\ref{gp}).

V Proof of Theorem 2

For any t𝒫n(𝒜)t\in\mathcal{P}_{n}(\mathcal{A}), define

dW(t):=infPΠW,ΓtP1.\displaystyle d_{W}(t):=\inf_{P\in\Pi_{W,\Gamma}^{*}}||t-P||_{1}.

For any 0<γV(Γ)4|𝒜|νmax0<\gamma\leq\frac{V(\Gamma)}{4|\mathcal{A}|\nu_{\max}}, define

𝒫nγ={xn𝒜n:c(xn)Γ and dW(t(xn))>γ}𝒫nγ,c={xn𝒜n:c(xn)Γ and dW(t(xn))γ}.\displaystyle\begin{split}\mathcal{P}_{n}^{\gamma}&=\left\{x^{n}\in\mathcal{A}^{n}:c(x^{n})\leq\Gamma\text{ and }d_{W}(t(x^{n}))>\gamma\right\}\\ \mathcal{P}_{n}^{\gamma,c}&=\left\{x^{n}\in\mathcal{A}^{n}:c(x^{n})\leq\Gamma\text{ and }d_{W}(t(x^{n}))\leq\gamma\right\}.\end{split} (21)
Definition 9

For any distribution P𝒫(𝒜)P\in\mathcal{P}(\mathcal{A}) and S𝒜S\subset\mathcal{A} such that P(S)>0P(S)>0, define the probability measure

P|S(x)={P(x)P(S)xS0otherwise.\displaystyle P|_{S}(x)=\begin{cases}\frac{P(x)}{P(S)}&x\in S\\ 0&\text{otherwise}.\end{cases}
Definition 10

For any k0k\geq 0 and any xk𝒜kx^{k}\in\mathcal{A}^{k}, let333For knk\geq n, 𝒜xk=\mathcal{A}_{x^{k}}=\emptyset and for k=0k=0, (xk,x)=(x,)(x^{k},x)=(x,).

𝒜xk={x𝒜:(xk,x) is a prefix of some xn𝒫nγ,c}.\displaystyle\mathcal{A}_{x^{k}}=\left\{x\in\mathcal{A}:(x^{k},x)\text{ is a prefix of some }x^{n}\in\mathcal{P}_{n}^{\gamma,c}\right\}.

Fix (a0,a1,,an1)𝒫nγ,c(a_{0},a_{1},\ldots,a_{n-1})\in\mathcal{P}_{n}^{\gamma,c} arbitrarily and let 𝟙ai𝒫(𝒜)\mathds{1}_{a_{i}}\in\mathcal{P}(\mathcal{A}) denote a single point-mass distribution at aia_{i}. Then for any 0kn10\leq k\leq n-1, xk𝒜k,ykkx^{k}\in\mathcal{A}^{k},y^{k}\in\mathcal{B}^{k} and a controller FF satisfying (15)(\ref{vc}), we define the controller FγF_{\gamma} as

Fγ(xk,yk)\displaystyle F_{\gamma}(x^{k},y^{k}) :={F(xk,yk)|𝒜xk if F(𝒜xk|xk,yk)>0Unif(Axk) if F(𝒜xk|xk,yk)=0 and |𝒜xk|0𝟙ak otherwise.\displaystyle:=\begin{cases}F(x^{k},y^{k})|_{\mathcal{A}_{x^{k}}}&\text{ if }F(\mathcal{A}_{x^{k}}|x^{k},y^{k})>0\\ \text{Unif}(A_{x^{k}})&\text{ if }F(\mathcal{A}_{x^{k}}|x^{k},y^{k})=0\text{ and }|\mathcal{A}_{x^{k}}|\neq 0\\ \mathds{1}_{a_{k}}&\text{ otherwise}.\end{cases}
Remark 1

Given any controller FF satisfying (15)(\ref{vc}), Definition 10 constructs a modified controller FγF_{\gamma} which satisfies

(FγW)(𝒫nγ,c)\displaystyle(F_{\gamma}\circ W)\left(\mathcal{P}_{n}^{\gamma,c}\right) :=xn𝒫nγ,cynnk=1nFγ(xk|xk1,yk1)W(yk|xk)\displaystyle:=\sum_{x^{n}\in\mathcal{P}_{n}^{\gamma,c}}\sum_{y^{n}\in\mathcal{B}^{n}}\prod_{k=1}^{n}F_{\gamma}(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k})
=1.\displaystyle=1. (22)

Intuitively, FγF_{\gamma} amplifies the probability assignments of FF over the set 𝒫nγ,c\mathcal{P}_{n}^{\gamma,c} and nullifies the probability assignments of FF over the set 𝒫nγ\mathcal{P}_{n}^{\gamma} so that Xn𝒫nγ,cX^{n}\in\mathcal{P}_{n}^{\gamma,c} almost surely for (Xn,Yn)(FγW)(X^{n},Y^{n})\sim(F_{\gamma}\circ W). Definition 10 is inspired by, and corrects an error in, [4, Def. 8]. With the definition given in [4, Def. 8], the analogue of (22) does not hold, although it is asserted in the proof of [4, Thm. 3]. This can be rectified by using Definition 10 in place of [4, Def. 8]. An analogous comment applies to the next definition and [4, Def. 9].

Definition 11

For any type t𝒫n(𝒜)t\in\mathcal{P}_{n}(\mathcal{A}) such that T𝒜n(t)𝒫nγT^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}, k0k\geq 0 and any xk𝒜kx^{k}\in\mathcal{A}^{k}, let

𝒜xkt={x𝒜:(xk,x) is a prefix of some xnT𝒜n(t)}.\displaystyle\mathcal{A}_{x^{k}}^{t}=\left\{x\in\mathcal{A}:(x^{k},x)\text{ is a prefix of some }x^{n}\in T^{n}_{\mathcal{A}}(t)\right\}.

Fix (a0,a1,,an1)T𝒜n(t)(a_{0},a_{1},\ldots,a_{n-1})\in T^{n}_{\mathcal{A}}(t) arbitrarily and let 𝟙ai𝒫(𝒜)\mathds{1}_{a_{i}}\in\mathcal{P}(\mathcal{A}) denote a single point-mass distribution at aia_{i}. Then for any 0kn10\leq k\leq n-1, xk𝒜k,ykkx^{k}\in\mathcal{A}^{k},y^{k}\in\mathcal{B}^{k} and a controller FF satisfying (15)(\ref{vc}), we define the controller FtF_{t} as

Ft(xk,yk)\displaystyle F_{t}(x^{k},y^{k}) :={F(xk,yk)|𝒜xkt if F(𝒜xkt|xk,yk)>0Unif(Axkt) if F(𝒜xkt|xk,yk)=0 and |𝒜xk|0𝟙ak otherwise.\displaystyle:=\begin{cases}F(x^{k},y^{k})|_{\mathcal{A}_{x^{k}}^{t}}&\text{ if }F(\mathcal{A}^{t}_{x^{k}}|x^{k},y^{k})>0\\ \text{Unif}(A^{t}_{x^{k}})&\text{ if }F(\mathcal{A}^{t}_{x^{k}}|x^{k},y^{k})=0\text{ and }|\mathcal{A}_{x^{k}}|\neq 0\\ \mathds{1}_{a_{k}}&\text{ otherwise}.\end{cases}
Remark 2

Given any controller FF satisfying (15)(\ref{vc}), Definition 11 constructs a modified controller FtF_{t} which satisfies (FtW)(T𝒜n(t))=1(F_{t}\circ W)\left(T^{n}_{\mathcal{A}}(t)\right)=1 for T𝒜n(t)𝒫nγT^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}.

Now let

q(yn)=12i=1nQ(yi)+121|𝒫n(𝒜)|t𝒫n(𝒜)i=1nqt(yi),\displaystyle q(y^{n})=\frac{1}{2}\prod_{i=1}^{n}Q^{*}(y_{i})+\frac{1}{2}\frac{1}{|\mathcal{P}_{n}(\mathcal{A})|}\sum_{t\in\mathcal{P}_{n}(\mathcal{A})}\prod_{i=1}^{n}q_{t}(y_{i}), (23)

where

qt(b):=a𝒜t(a)W(b|a).\displaystyle q_{t}(b):=\sum_{a\in\mathcal{A}}t(a)W(b|a).

Let PP denote the distribution FWF\circ W. Let PγP_{\gamma} denote the distribution FγWF_{\gamma}\circ W. Let PtP_{t} denote the distribution FtWF_{t}\circ W for each t𝒫n(𝒜)t\in\mathcal{P}_{n}(\mathcal{A}) such that T𝒜n(t)𝒫nγT^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}. Note that all controllers F,FγF,F_{\gamma} and FtF_{t} satisfy (15)(\ref{vc}). We have

P(W(Yn|Xn)q(Yn)>ρ)\displaystyle P\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)
=P(W(Yn|Xn)q(Yn)>ρdW(t(Xn))γ)+P(W(Yn|Xn)q(Yn)>ρdW(t(Xn))>γ)\displaystyle=P\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+P\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))>\gamma\right)
=P(W(Yn|Xn)q(Yn)>ρdW(t(Xn))γ)+t:T𝒜n(t)𝒫nγP(W(Yn|Xn)q(Yn)>ρt(Xn)=t)\displaystyle=P\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\cap t(X^{n})=t\right)
(a)Pγ(W(Yn|Xn)q(Yn)>ρ)+t:T𝒜n(t)𝒫nγPt(W(Yn|Xn)q(Yn)>ρ).\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right). (24)

Inequality (a)(a) follows from the following argument. For any (xn,yn)(x^{n},y^{n}) such that xn𝒫nγ,cx^{n}\in\mathcal{P}_{n}^{\gamma,c}, note that for all 1kn1\leq k\leq n, 𝒜xk1\mathcal{A}_{x^{k-1}}\neq\emptyset and xk𝒜xk1x_{k}\in\mathcal{A}_{x^{k-1}} so that

F(xk|xk1,yk1)F(𝒜xk1|xk1,yk1).\displaystyle F(x_{k}|x^{k-1},y^{k-1})\leq F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1}). (25)

Then

Pγ((xn,yn))\displaystyle P_{\gamma}\left((x^{n},y^{n})\right)
=k=1nFγ(xk|xk1,yk1)W(yk|xk)\displaystyle=\prod_{k=1}^{n}F_{\gamma}(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k})
(b)k=1nF(xk|xk1,yk1)F(𝒜xk1|xk1,yk1)W(yk|xk)\displaystyle\stackrel{{\scriptstyle(b)}}{{\geq}}\prod_{k=1}^{n}\frac{F(x_{k}|x^{k-1},y^{k-1})}{F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1})}W(y_{k}|x_{k})
k=1nF(xk|xk1,yk1)W(yk|xk)\displaystyle\geq\prod_{k=1}^{n}F(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k})
=P((xn,yn)).\displaystyle=P((x^{n},y^{n})).

With some abuse of notation, we assume in inequality (b)(b) above that if F(𝒜xk1|xk1,yk1)=0F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1})=0, then

F(xk|xk1,yk1)F(𝒜xk1|xk1,yk1)=0\frac{F(x_{k}|x^{k-1},y^{k-1})}{F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1})}=0

which is justified by (25)(\ref{fw}).

A similar derivation gives

Pt((xn,yn))P(xn,yn)\displaystyle P_{t}((x^{n},y^{n}))\geq P(x^{n},y^{n})

for all (xn,yn)(x^{n},y^{n}) such that c(xn)Γc(x^{n})\leq\Gamma and t(xn)=tt(x^{n})=t, where T𝒜n(t)𝒫nγT^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}.

Let ρ=exp(nC(Γ)+nr)\rho=\exp\left(nC(\Gamma)+\sqrt{n}r\right), where rr will be specified later. Define444This proof follows that of [4, Thm. 3] and corrects an error therein: the i\mathcal{F}_{i} filtration defined before (98) should be defined like 𝒢i\mathcal{G}_{i} here.

𝒢i\displaystyle\mathcal{G}_{i} :=σ(X1,,Xi+1,Y1,,Yi)\displaystyle:=\sigma(X_{1},\ldots,X_{i+1},Y_{1},\ldots,Y_{i})
Zi\displaystyle Z_{i} :=i(Xi,Yi)𝔼[i(Xi,Yi)|𝒢i1]\displaystyle:=i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})|\mathcal{G}_{i-1}\right]
i\displaystyle\mathcal{F}_{i} :=σ(Z1,Z2,,Zi).\displaystyle:=\sigma(Z_{1},Z_{2},\ldots,Z_{i}).

Two things are important to note here. First, by the Markov property (Xi1,Yi1)XiYi(X^{i-1},Y^{i-1})-X_{i}-Y_{i}, we have 𝔼[i(Xi,Yi)|𝒢i1]=𝔼[i(Xi,Yi)|Xi]\mathbb{E}\left[i(X_{i},Y_{i})|\mathcal{G}_{i-1}\right]=\mathbb{E}\left[i(X_{i},Y_{i})|X_{i}\right] a.s. Second, i𝒢i\mathcal{F}_{i}\subset\mathcal{G}_{i}.

For the first term in (24)(\ref{q}), we can upper bound it as follows:

Pγ(W(Yn|Xn)q(Yn)>ρ)\displaystyle P_{\gamma}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)
Pγ(i=1nW(Yi|Xi)Q(Yi)>ρ2)\displaystyle\leq P_{\gamma}\left(\prod_{i=1}^{n}\frac{W(Y_{i}|X_{i})}{Q^{*}(Y_{i})}>\frac{\rho}{2}\right)
=Pγ(i=1n[log(W(Yi|Xi)Q(Yi))C(Γ)]>nrlog(2))\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}\left[\log\left(\frac{W(Y_{i}|X_{i})}{Q^{*}(Y_{i})}\right)-C(\Gamma)\right]>\sqrt{n}r-\log(2)\right)
(a)Pγ(i=1n[i(Xi,Yi)𝔼[i(Xi,Yi)|𝒢i1]]>nrlog(2))\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\sum_{i=1}^{n}\left[i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})|\mathcal{G}_{i-1}\right]\right]>\sqrt{n}r-\log(2)\right)
=Pγ(i=1nZi>nrlog(2)).\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right). (26)

In inequality (a)(a), we used the following lemma and the fact that c(Xn)Γc(X^{n})\leq\Gamma almost surely.

Lemma 2

For Γ(Γ0,Γ)\Gamma\in(\Gamma_{0},\Gamma^{*}),

𝔼[i(X,Y)|X]C(Γ)C(Γ)(Γc(X))\displaystyle\mathbb{E}\left[i(X,Y)|X\right]\leq C(\Gamma)-C^{\prime}(\Gamma)\left(\Gamma-c(X)\right) (27)

almost surely, where XX has an arbitrary distribution and YY is the output of the channel WW when XX is the input.

Proof: See [24, Proposition 1] and its references.

We will now apply a martingale central limit theorem [25, Corollary to Theorem 2] to the expression in (26)(\ref{MCLT}). We first verify that the hypotheses of [25, Corollary to Theorem 2] are satisfied:

  1. 1.

    First, we require that

    max1kn|Zk|<.\max_{1\leq k\leq n}|Z_{k}|<\infty.

    Since Q(b)>0Q^{*}(b)>0 for all bb\in\mathcal{B} by assumption and W(Yk|Xk)>0W(Y_{k}|X_{k})>0 almost surely for each channel input and output pair (Xk,Yk)(X_{k},Y_{k}), we have

    |Zk|\displaystyle|Z_{k}| maxa𝒜,b:W(b|a)>02|i(a,b)|\displaystyle\leq\max_{a\in\mathcal{A},b\in\mathcal{B}:W(b|a)>0}2\,|i(a,b)|
    :=2imax<\displaystyle:=2i_{\max}<\infty

    for all 1kn1\leq k\leq n.

  2. 2.

    Second, we require that

    𝔼[Zk|k1]=0\displaystyle\mathbb{E}\left[Z_{k}|\mathcal{F}_{k-1}\right]=0

    almost surely for all 1kn1\leq k\leq n [25, p. 672]. This is true because 𝔼[Zk|𝒢k1]=0\mathbb{E}\left[Z_{k}|\mathcal{G}_{k-1}\right]=0 implies

    𝔼[𝔼[Zk|𝒢k1]|k1]\displaystyle\mathbb{E}\left[\mathbb{E}\left[Z_{k}|\mathcal{G}_{k-1}\right]|\mathcal{F}_{k-1}\right] =0\displaystyle=0
    𝔼[Zk|k1]\displaystyle\mathbb{E}\left[Z_{k}|\mathcal{F}_{k-1}\right] =0.\displaystyle=0.

Under the above two conditions, it follows from [25, Corollary to Theorem 2] that there exists a constant κ>0\kappa>0 depending only on imaxi_{\max} such that for any ss\in\mathbb{R},

Pγ(1i=1n𝔼[Zi2]i=1nZis)Φ(s)κ[nlogn(i=1n𝔼γ[Zi2])32+i=1n𝔼γ[Zi2|i1]i=1n𝔼γ[Zi2]11/2].\displaystyle P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq s\right)\geq\Phi\left(s\right)-\kappa\left[\frac{n\log n}{\left(\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\right)^{\frac{3}{2}}}+\Bigg{|}\Bigg{|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{|}\Bigg{|}_{\infty}^{1/2}\right]. (28)

Using Lemma 3 in (28)(\ref{conv}), we obtain

Pγ(1i=1n𝔼[Zi2]i=1nZis)\displaystyle P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq s\right) Φ(s)κ[nlogn(nV(Γ)n|𝒜|γνmax)32+(4|𝒜|γνmaxV(Γ))1/2]\displaystyle\geq\Phi\left(s\right)-\kappa\left[\frac{n\log n}{\left(nV(\Gamma)-n|\mathcal{A}|\gamma\nu_{\max}\right)^{\frac{3}{2}}}+\left(\frac{4|\mathcal{A}|\gamma\nu_{\max}}{V(\Gamma)}\right)^{1/2}\right]
Φ(s)βγ,\displaystyle\geq\Phi(s)-\beta_{\gamma}, (29)

where the last inequality holds for sufficiently large nn for some constant βγ>0\beta_{\gamma}>0 which can be chosen such that βγ0\beta_{\gamma}\to 0 as γ0\gamma\to 0.

Lemma 3

We have

V(Γ)2γνmax1ni=1n𝔼γ[Zi2]V(Γ)+2γνmax.\displaystyle V(\Gamma)-2\gamma\nu_{\max}\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\leq V(\Gamma)+2\gamma\nu_{\max}.

Furthermore, for γV(Γ)4νmax\gamma\leq\frac{V(\Gamma)}{4\nu_{\max}},

i=1n𝔼γ[Zi2|i1]i=1n𝔼γ[Zi2]18γνmaxV(Γ)\displaystyle\Bigg{|}\Bigg{|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{|}\Bigg{|}_{\infty}\leq\frac{8\gamma\nu_{\max}}{V(\Gamma)}

almost surely according to the probability measure PγP_{\gamma}.

Proof: The proof of Lemma 3 is given in Appendix A.

Using the result in (29)(\ref{finres}) and Lemma 3 in the expression in (26)(\ref{MCLT}), we obtain

Pγ(i=1nZi>nrlog(2))\displaystyle P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right)
=Pγ(1i=1n𝔼γ[Zi2]i=1nZi>nrlog(2)i=1n𝔼γ[Zi2])\displaystyle=P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}>\frac{\sqrt{n}r-\log(2)}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\right)
=1Pγ(1i=1n𝔼γ[Zi2]i=1nZinrlog(2)i=1n𝔼γ[Zi2])\displaystyle=1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{\sqrt{n}r-\log(2)}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\right)
{1Pγ(1i=1n𝔼γ[Zi2]i=1nZirlog(2)nV(Γ)+|𝒜|γνmax) if rlog2n1Pγ(1i=1n𝔼γ[Zi2]i=1nZirlog(2)nV(Γ)|𝒜|γνmax) if r<log2n\displaystyle\leq\begin{cases}1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}}\right)&\text{ if }r\geq\frac{\log 2}{\sqrt{n}}\\ 1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}}\right)&\text{ if }r<\frac{\log 2}{\sqrt{n}}\end{cases}
{1Φ(rlog(2)nV(Γ)+|𝒜|γνmax)+βγ if rlog2n1Φ(rlog(2)nV(Γ)|𝒜|γνmax)+βγ if r<log2n.\displaystyle\leq\begin{cases}1-\Phi\left(\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}}\right)+\beta_{\gamma}&\text{ if }r\geq\frac{\log 2}{\sqrt{n}}\\ 1-\Phi\left(\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}}\right)+\beta_{\gamma}&\text{ if }r<\frac{\log 2}{\sqrt{n}}.\end{cases} (30)

Let

r={V(Γ)+|𝒜|γνmaxΦ1(ϵ+3βγ)+log2n if ϵ[123βγ,1)V(Γ)|𝒜|γνmaxΦ1(ϵ+3βγ)+log2n if ϵ(0,123βγ).\displaystyle r=\begin{cases}\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)+\frac{\log 2}{\sqrt{n}}&\text{ if }\epsilon\in\left[\frac{1}{2}-3\beta_{\gamma},1\right)\\ \sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)+\frac{\log 2}{\sqrt{n}}&\text{ if }\epsilon\in\left(0,\frac{1}{2}-3\beta_{\gamma}\right).\end{cases} (31)

Note that the upper bound in (30)(\ref{userhere}) holds for any rr. For a given error probability ϵ(0,1)\epsilon\in(0,1), we choose rr according to (31)(\ref{myrval}). Then using (31)(\ref{myrval}) in (30)(\ref{userhere}), we obtain for any given ϵ(0,1)\epsilon\in(0,1) that

Pγ(i=1nZi>nrlog(2))\displaystyle P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right) 1ϵ2βγ.\displaystyle\leq 1-\epsilon-2\beta_{\gamma}. (32)

(32)(\ref{firstterm}) provides an upper bound to the first term in (24)(\ref{q}).

We now upper bound the second term in (24)(\ref{q}).

Using again the choice of qq in (23)(\ref{choiceq}), we have

t:T𝒜n(t)𝒫nγPt(W(Yn|Xn)q(Yn)>ρ)\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)
t:T𝒜n(t)𝒫nγPt(W(Yn|Xn)i=1nqt(yi)>ρ2|𝒫n(𝒜)|)\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}|X^{n})}{\prod_{i=1}^{n}q_{t}(y_{i})}>\frac{\rho}{2|\mathcal{P}_{n}(\mathcal{A})|}\right)
t:T𝒜n(t)𝒫nγPt(i=1nlogW(Yi|Xi)qt(Yi)>nC(Γ)+nrlog2(n+1)|𝒜|)\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}|X_{i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{|\mathcal{A}|}\right)
=(a)t:T𝒜n(t)𝒫nγW(i=1nlogW(Yi|xt,i)qt(Yi)>nC(Γ)+nrlog2(n+1)|𝒜|),\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}|x_{t,i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{|\mathcal{A}|}\right), (33)

where in equality (a)(a), (xt,1,,xt,n)(x_{t,1},\ldots,x_{t,n}) is any arbitrary sequence from the type class T𝒜n(t)T^{n}_{\mathcal{A}}(t). Equality (a)(a) holds because under the probability measure PtP_{t}, t(Xn)=tt(X^{n})=t a.s. (see Remark 2) and the distribution of

i=1nlogW(Yi|Xi)qt(Yi)\sum_{i=1}^{n}\log\frac{W(Y_{i}|X_{i})}{q_{t}(Y_{i})}

depends on XnX^{n} only through its type. Continuing from (33)(\ref{typeetype}), we have

t:T𝒜n(t)𝒫nγPt(W(Yn|Xn)q(Yn)>ρ)\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)
t:T𝒜n(t)𝒫nγW(i=1n[logW(Yi|xt,i)qt(Yi)𝔼W[logW(Y|xt,i)qt(Y)]]>\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y|x_{t,i})}{q_{t}(Y)}\right]\right]>\right.
n[C(Γ)I(t,W)]+nrlog2(n+1)|𝒜|)\displaystyle\quad\quad\quad\quad\quad\quad\quad n\left[C(\Gamma)-I(t,W)\right]+\sqrt{n}r-\log 2(n+1)^{|\mathcal{A}|}\Bigg{)}
t:T𝒜n(t)𝒫nγW(i=1n[logW(Yi|xt,i)qt(Yi)𝔼W[logW(Y|xt,i)qt(Y)]]>nK2)\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y|x_{t,i})}{q_{t}(Y)}\right]\right]>n\frac{K}{2}\right) (34)

where the last inequality holds for sufficiently large nn because rr, as defined in (31)(\ref{myrval}), is an O(1)O(1) term, and from the construction of the set 𝒫nγ\mathcal{P}_{n}^{\gamma}, we have

inft:T𝒜n(t)𝒫nγdW(t)γ>0\inf_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}\,d_{W}(t)\geq\gamma>0

which implies

inft:T𝒜n(t)𝒫nγ[C(Γ)I(t,W)]>K\inf_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}\left[C(\Gamma)-I(t,W)\right]>K

for some constant K>0K>0.

Let imax,t:=maxa,b:qt(b)W(b|a)>0|logW(b|a)qt(b)|i_{\max,t}:=\max_{a,b:q_{t}(b)W(b|a)>0}\big{|}\log\frac{W(b|a)}{q_{t}(b)}\big{|}. We now show that imax,t2logni_{\max,t}\leq 2\log n for all tt. Let Wmin:=mina,b:W(b|a)>0W(b|a)W_{\min}:=\min_{a,b:W(b|a)>0}W(b|a) and qmin,t:=minb:qt(b)>0qt(b)q_{\min,t}:=\min_{b:q_{t}(b)>0}q_{t}(b). Then

qmin,t\displaystyle q_{\min,t} =minb:qt(b)>0a𝒜t(a)W(b|a)\displaystyle=\min_{b:q_{t}(b)>0}\sum_{a\in\mathcal{A}}t(a)W(b|a)
mina,b:W(b|a)>0W(b|a)mina:t(a)>0t(a)\displaystyle\geq\min_{a,b:W(b|a)>0}W(b|a)\min_{a:t(a)>0}t(a)
=Wminn.\displaystyle=\frac{W_{\min}}{n}.

Thus,

imax,t\displaystyle i_{\max,t} =maxa,b:qt(b)W(b|a)>0|logW(b|a)qt(b)|\displaystyle=\max_{a,b:q_{t}(b)W(b|a)>0}\big{|}\log\frac{W(b|a)}{q_{t}(b)}\big{|}
maxa,b:qt(b)W(b|a)>0|logW(b|a)|+maxb:qt(b)>0|logqt(b)|\displaystyle\leq\max_{a,b:q_{t}(b)W(b|a)>0}\big{|}\log W(b|a)\big{|}+\max_{b:q_{t}(b)>0}\big{|}\log q_{t}(b)\big{|}
lognWmin2\displaystyle\leq\log\frac{n}{W_{\min}^{2}}
2logn\displaystyle\leq 2\log n

for all sufficiently large nn. Hence, we can use Azuma’s inequality [26, (33), p. 61] to upper bound (34)(\ref{recycle}), giving us

t:T𝒜n(t)𝒫nγPt(W(Yn|Xn)q(Yn)>ρ)\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)
(n+1)|𝒜|exp(nK2128log2n)\displaystyle\leq(n+1)^{|\mathcal{A}|}\exp\left(-\frac{nK^{2}}{128\log^{2}n}\right) (35)

which goes to zero as nn\to\infty.

Substituting the upper bounds (32)(\ref{firstterm}) and (35)(\ref{pn4}) in (24)(\ref{q}), we obtain

(FW)(W(Yn|Xn)q(Yn)>exp(nC(Γ)+nr))\displaystyle(F\circ W)\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\exp\left(nC(\Gamma)+\sqrt{n}r\right)\right)
1ϵβγ\displaystyle\leq 1-\epsilon-\beta_{\gamma}

for sufficiently large nn. Since the controller FF was arbitrary, we can apply Lemma 1 to obtain

logMfb(n,ϵ,Γ)\displaystyle\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma) nC(Γ)+nrlogβγ\displaystyle\leq nC(\Gamma)+\sqrt{n}r-\log\beta_{\gamma}
logMfb(n,ϵ,Γ)nC(Γ)n\displaystyle\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}} rlogβγn\displaystyle\leq r-\frac{\log\beta_{\gamma}}{\sqrt{n}}
lim supnlogMfb(n,ϵ,Γ)nC(Γ)n\displaystyle\limsup_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}} r,\displaystyle\leq r^{\prime},

where rr^{\prime} is obtained from the expression of rr in (31)(\ref{myrval}) after taking the limit as nn\to\infty, i.e.,

r={V(Γ)+|𝒜|γνmaxΦ1(ϵ+3βγ) if ϵ[123βγ,1)V(Γ)|𝒜|γνmaxΦ1(ϵ+3βγ) if ϵ(0,123βγ).\displaystyle r^{\prime}=\begin{cases}\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)&\text{ if }\epsilon\in\left[\frac{1}{2}-3\beta_{\gamma},1\right)\\ \sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)&\text{ if }\epsilon\in\left(0,\frac{1}{2}-3\beta_{\gamma}\right).\end{cases}

Finally, since V(Γ)4|𝒜|νmax>γ>0\frac{V(\Gamma)}{4|\mathcal{A}|\nu_{\max}}>\gamma>0 was arbitrary, we can take γ\gamma and βγ\beta_{\gamma} arbitrarily small, giving us the converse result

lim supnlogMfb(n,ϵ,Γ)nC(Γ)nV(Γ)Φ1(ϵ).\displaystyle\limsup_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}\leq\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon).

Since this matches the optimal non-feedback SOCR of simple dispersion DMCs with a peak-power cost constraint, we have

limnlogMfb(n,ϵ,Γ)nC(Γ)n=V(Γ)Φ1(ϵ)\displaystyle\lim_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}=\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon) (36)

for simple-dispersion DMCs with a peak-power cost constraint.

Appendix A Proof of Lemma 3

We have

i=1n𝔼γ[Zi2|𝒢i1]\displaystyle\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}|\mathcal{G}_{i-1}\right]
=i=1n𝔼γ[(i(Xi,Yi)𝔼[i(Xi,Yi)|Xi])2|𝒢i1]\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[\left(i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})|X_{i}\right]\right)^{2}|\mathcal{G}_{i-1}\right]
=i=1n𝔼γ[(i(Xi,Yi)𝔼[i(Xi,Yi)|Xi])2|Xi]\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[\left(i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})|X_{i}\right]\right)^{2}|X_{i}\right]
=i=1nVar(i(Xi,Yi)|Xi)\displaystyle=\sum_{i=1}^{n}\text{Var}\left(i(X_{i},Y_{i})|X_{i}\right)
=i=1na𝒜𝟙(Xi=a)νa\displaystyle=\sum_{i=1}^{n}\sum_{a\in\mathcal{A}}\mathds{1}\left(X_{i}=a\right)\nu_{a}
=na𝒜PXn(a)νa.\displaystyle=n\sum_{a\in\mathcal{A}}P_{X^{n}}(a)\nu_{a}.

From Remark 1, since dW(t(Xn))γd_{W}(t(X^{n}))\leq\gamma a.s., there exists a P~ΠW,Γ\tilde{P}\in\Pi_{W,\Gamma}^{*} such that t(Xn)P~12γ||t(X^{n})-\tilde{P}||_{1}\leq 2\gamma. Hence,

nV(Γ)2nγνmax\displaystyle nV(\Gamma)-2n\gamma\nu_{\max} i=1n𝔼γ[Zi2|𝒢i1]nV(Γ)+2nγνmax\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}|\mathcal{G}_{i-1}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}

a.s., where we used the fact that WW is simple-dispersion at the cost Γ\Gamma. Furthermore, since i1𝒢i1\mathcal{F}_{i-1}\subset\mathcal{G}_{i-1},

nV(Γ)2nγνmax\displaystyle nV(\Gamma)-2n\gamma\nu_{\max} i=1n𝔼γ[Zi2|i1]nV(Γ)+2nγνmax\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}|\mathcal{F}_{i-1}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}

a.s. and

nV(Γ)2nγνmax\displaystyle nV(\Gamma)-2n\gamma\nu_{\max} i=1n𝔼γ[Zi2]nV(Γ)+2nγνmax.\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}.

Finally, we have

|i=1n𝔼γ[Zi2|i1]i=1n𝔼γ[Zi2]1|\displaystyle\Bigg{|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{|}
|V(Γ)+2γνmaxV(Γ)2γνmax1|\displaystyle\leq\Bigg{|}\frac{V(\Gamma)+2\gamma\nu_{\max}}{V(\Gamma)-2\gamma\nu_{\max}}-1\Bigg{|}
=|4γνmaxV(Γ)2γνmax|\displaystyle=\Bigg{|}\frac{4\gamma\nu_{\max}}{V(\Gamma)-2\gamma\nu_{\max}}\Bigg{|}
8γνmaxV(Γ),\displaystyle\leq\frac{8\gamma\nu_{\max}}{V(\Gamma)},

assuming γV(Γ)4νmax\gamma\leq\frac{V(\Gamma)}{4\nu_{\max}}.

Acknowledgment

This research was supported by the US National Science Foundation under grant CCF-1956192.

References

  • [1] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.   Cambridge University Press, 2011.
  • [2] V. Strassen, “Asymptotische abschätzungen in Shannon’s informationstheorie,” in Proc. Trans. 3rd Prague Conf Inf. Theory, Prague, Czech, 1962, pp. 689–723.
  • [3] M. Hayashi, “Information spectrum approach to second-order coding rate in channel coding,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 4947–4966, 2009.
  • [4] A. B. Wagner, N. V. Shende, and Y. Altuğ, “A new method for employing feedback to improve coding performance,” IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 6660–6681, 2020.
  • [5] Y. Y. Shkel, V. Y. F. Tan, and S. C. Draper, “Second-order coding rate for mm-class source-channel codes,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015, pp. 620–626.
  • [6] M. Tomamichel and V. Y. F. Tan, “Second-order coding rates for channels with state,” IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 4427–4448, 2014.
  • [7] C. E. Shannon, “Probability of error for optimal codes in a gaussian channel,” The Bell System Technical Journal, vol. 38, no. 3, pp. 611–656, 1959.
  • [8] A. Mahmood and A. B. Wagner, “Channel coding with mean and variance cost constraints,” in 2024 IEEE International Symposium on Information Theory (ISIT), 2024, pp. 510–515.
  • [9] R. Yates, “A framework for uplink power control in cellular radio systems,” IEEE Journal on Selected Areas in Communications, vol. 13, no. 7, pp. 1341–1347, 1995.
  • [10] A. Goldsmith and P. Varaiya, “Capacity of fading channels with channel side information,” IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 1986–1992, 1997.
  • [11] S. Hanly and D. Tse, “Multiaccess fading channels. ii. delay-limited capacities,” IEEE Transactions on Information Theory, vol. 44, no. 7, pp. 2816–2831, 1998.
  • [12] A. Lozano and N. Jindal, “Transmit diversity vs. spatial multiplexing in modern mimo systems,” IEEE Transactions on Wireless Communications, vol. 9, no. 1, pp. 186–197, 2010.
  • [13] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Dept. Elect. Eng., Princeton Univ., Princeton, NJ, USA, 2010.
  • [14] A. Sahai, S. Draper, and M. Gastpar, “Boosting reliability over AWGN networks with average power constraints and noiseless feedback,” in Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., 2005, pp. 402–406.
  • [15] S. L. Fong and V. Y. F. Tan, “Asymptotic expansions for the awgn channel with feedback under a peak power constraint,” in 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 311–315.
  • [16] A. Mahmood and A. B. Wagner, “Timid/bold coding for channels with cost constraints,” in 2023 IEEE International Symposium on Information Theory (ISIT), 2023, pp. 1442–1447.
  • [17] V. Kostina and S. Verdú, “Channels with cost constraints: Strong converse and dispersion,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2415–2429, 2015.
  • [18] S. L. Fong and V. Y. F. Tan, “A tight upper bound on the second-order coding rate of the parallel Gaussian channel with feedback,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6474–6486, 2017.
  • [19] W. Yang, G. Caire, G. Durisi, and Y. Polyanskiy, “Optimum power control at finite blocklength,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4598–4615, 2015.
  • [20] V. Y. F. Tan and M. Tomamichel, “The third-order term in the normal approximation for the awgn channel,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2430–2438, 2015.
  • [21] R. G. Gallager, Information Theory and Reliable Communication.   New York, NY, USA: Wiley, 1968.
  • [22] C. Shannon, “The zero error capacity of a noisy channel,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8–19, 1956.
  • [23] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
  • [24] A. Mahmood and A. B. Wagner, “Channel coding with mean and variance cost constraints,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16417
  • [25] E. Bolthausen, “Exact convergence rates in some martingale central limit theorems,” The Annals of Probability, vol. 10, no. 3, pp. 672–688, Aug. 1982.
  • [26] B. Bercu, B. Delyon, and E. Rio, Concentration Inequalities for Sums and Martingales.   Cham, Switzerland: Springer, 2015.