Improved Channel Coding Performance Through Cost Variability

Adeel Mahmood and Aaron B. Wagner
School of Electrical and Computer Engineering, Cornell University

Abstract

Channel coding for discrete memoryless channels (DMCs) with mean and variance cost constraints has been recently introduced. We show that there is an improvement in coding performance due to cost variability, both with and without feedback. We demonstrate this improvement over the traditional almost-sure cost constraint (also called the peak-power constraint) that prohibits any cost variation above a fixed threshold. Our result simultaneously shows that feedback does not improve the second-order coding rate of simple-dispersion DMCs under the peak-power constraint. This finding parallels similar results for unconstrained simple-dispersion DMCs, additive white Gaussian noise (AWGN) channels and parallel Gaussian channels.

Index Terms:

Channel coding, feedback communications, second-order coding rate, stochastic control.

I Introduction

Channel coding is a fundamental problem focused on the reliable transmission of information over a noisy channel. Information transmission with arbitrarily small error probability is possible at all rates below the capacity $C$ of the channel, if the number $n$ of channel uses (also called the blocklength) is permitted to grow without bound [1]. At finite blocklengths, there is an unavoidable backoff from capacity due to the random nature of the channel. The second-order coding rate (SOCR) ([2, 3, 4, 5, 6]) quantifies the $O(n^{-1/2})$ convergence to the capacity.

In many practical scenarios, the channel input is subject to some cost constraints which limit the amount of resources that can be used for transmission. With a cost constraint present, the role of capacity is replaced by the capacity-cost function [1, Theorem 6.11]. One common form of the cost constraint is the almost-sure (a.s.) cost constraint ([3, 7]) which bounds the time-average cost of the channel input $X^{n}$ over all messages, realizations of any side randomness, channel noise (if there is feedback), etc.:

\displaystyle\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\leq\Gamma\quad\text{almost surely,}

(1)

where $c(\cdot)$ is the cost function. Under the almost-sure (a.s.) cost constraint, the optimal first-order coding rate is the capacity-cost function, the strong converse holds [1, Theorem 6.11], and the optimal SOCR is also known [3, Theorem 3].

The a.s. cost constraint is quite unforgiving, never allowing the cost to exceed the threshold under any circumstances. Our first result (Theorem 1) shows that the SOCR can be strictly improved by merely allowing the cost to fluctuate above the threshold in a manner consistent with a noise process, i.e., the fluctuations have a variance of $O(1/n)$ . Our second result (Theorem 2) shows that the a.s. cost framework does not allow feedback improvement to SOCR for simple-dispersion¹¹1See Definition 1. DMCs. This again contrasts with the scenario where random fluctuations with variance $O(1/n)$ above the threshold are allowed, as shown in [8, Theorem 3]. This highlights the important role cost variability plays in enabling feedback mechanisms to improve coding performance.

These findings raise the question of whether it is necessary to impose a constraint as stringent as (1). Cost constraints in communication systems are typically imposed to achieve goals such as operating circuitry in the linear regime, minimizing power consumption, and reducing interference with other terminals. It is worth noting that these goals do not always necessitate the use of the strict a.s. cost constraint. For example, the expected cost constraint is often used in wireless communication literature (see, e.g., [9, 10, 11, 12]) because it allows for a dynamic allocation of power based on the current channel state. The expected cost constraint bounds the cost averaged over time and the ensemble:

\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right]\leq\Gamma.

(2)

Yet, if the a.s. constraint is too strict, the expectation constraint is arguably too weak. The expectation constraint allows highly non-ergodic use of power, as shown in Section II-A, which is problematic both from the vantage points of operating circuitry in the linear regime and interference management.

The $O(1/n)$ variance allowance is a feature of a new cost formulation, referred to as mean and variance cost constraints in [8]. This formulation replaces (1) with the following conditions:

	$\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right]$	$\displaystyle\leq\Gamma,$		(3)
	$\displaystyle\text{Var}\left(\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right)$	$\displaystyle\leq\frac{V}{n}.$		(4)

The mean and variance cost constraints were introduced as a relaxed version of the a.s. cost constraint that permits a small amount of stochastic fluctuation above the threshold $\Gamma$ while providing an ergodicity guarantee. Consider a random channel codebook whose codewords satisfy $(\ref{exp01})$ with equality. For a given input $x^{n}$ , define an ergodicity metric $\mathcal{E}_{m}$ as

\displaystyle\mathcal{E}_{m}(x^{n}):=\frac{\max\left(\frac{1}{n}\sum_{i=1}^{n}c(x_{i})-\Gamma,0\right)}{\Gamma}.

(5)

The definition in $(\ref{h23})$ only penalizes cost variation above the threshold and normalizes by the mean cost $\Gamma$ . Let $\alpha>0$ be the desired ergodicity parameter. We say that a transmission $x^{n}$ is $\alpha$ -ergodic if $\mathcal{E}_{m}(x^{n})\leq\alpha$ . Let $\beta$ be the desired uncertainty parameter. We say that a random codebook is $(\alpha,\beta)$ -ergodic if $\mathbb{P}\left(\mathcal{E}_{m}(X^{n})\leq\alpha\right)\geq 1-\beta$ , where $X^{n}$ is a random transmission from the codebook.

Under the mean and variance cost formulation, we have $\mathbb{P}\left(\mathcal{E}_{m}(X^{n})\leq\alpha\right)\geq 1-\beta$ if

\displaystyle n\geq n_{c}:=\frac{V}{\beta\alpha^{2}\Gamma^{2}},

(6)

where we call $n_{c}$ the critical blocklength. Thus, the critical blocklength specifies the minimum blocklength of a channel code for which transmission behaves ergodically with high probability. For fixed $\alpha$ , $\beta$ , and $\Gamma$ , the parameters $n_{c}$ and $V$ are in one-to-one correspondence, so one can view the choice of $V$ in (4) as specifying the critical blocklength. Note that with an expectation-only constraint, we effectively have $V=\infty$ , so the transmission is not guaranteed to be ergodic at any blocklength. Furthermore, unlike the expected cost constraint, the mean and variance cost formulation:

•

allows for a strong converse [13, Theorem 77], [8],
•

allows for a finite second-order coding rate [8],
•

does not allow blasting power on errors in the feedback case [14].

The results of this paper also have significance in the context of previous works. Our result in Theorem 2 extends the previously known result that feedback does not improve the second-order performance for simple-dispersion DMCs without cost constraints [4]. It is also similar to the result in [15] that feedback does not improve the second-order performance for AWGN channels.

Random channel coding schemes often use independent and identically distributed (i.i.d.) codewords. It was noted in [16] that the a.s. cost constraint, which is the most commonly considered cost constraint in the context of discrete memoryless channels (DMCs), prohibits the use of i.i.d. codewords. It was shown in [16] that a feedback scheme that uses both i.i.d. and constant-composition codewords leads to an improved SOCR compared to the best non-feedback SOCR achievable under the a.s. cost constraint. Our result in Theorem 2 strengthens the result in [16] by showing that the aforementioned improvement also holds compared to the best feedback SOCR achievable under the a.s. cost constraint.

I-A Related Work

The second- and third-order asymptotics for DMCs with the a.s. cost constraint in the non-feedback setting have been characterized in [3] and [17], respectively. The second-order asymptotics in the feedback setting of DMCs that are not simple-dispersion are studied in [4] without cost constraints. There are more feedback results available for AWGN channels compared to DMCs under the a.s. cost constraint. For example, the result in [15] also addresses the third-order performance with feedback while [18] gives the result that feedback does not improve the second-order performance for parallel Gaussian channels. The second-order performance for the AWGN channel with an expected cost constraint is characterized in [19]. Table I summarizes these results across different settings in channel coding.

Paper	Channel	Performance	Cost Constraint	Feedback	Non-feedback
Hayashi [3]	DMC, AWGN	2nd order	a.s.	No	Yes
Tan and Tomamichel [20]	AWGN	3rd order	a.s.	No	Yes
Kostina and Verdú [17]	DMC	3rd order	a.s.	No	Yes
Fong and Tan [15]	AWGN	2nd and 3rd order	a.s.	Yes	No
Wagner, Shende and Altuğ [4]	DMC	2nd order	none	Yes	No
Mahmood and Wagner [8]	DMC	2nd order	mean and variance	Yes	Yes
This paper	DMC	2nd order	mean and variance, a.s.	Yes	Yes
Polyanskiy [13, Th. 78]	Parallel AWGN	2nd order	a.s.	No	Yes
Fong and Tan [18]	Parallel AWGN	2nd order	a.s.	Yes	No
Polyanskiy [13, Th. 77]	AWGN	1st order	expected cost	No	Yes
Yang et al. [19]	AWGN	2nd order	expected cost	No	Yes

TABLE I: Relevant results across different settings in channel coding.

Our proof technique for Theorem 2 is more closely aligned with that used in [4] for DMCs than in [15] for AWGN channels. Both proofs show converse bounds with feedback that match the previously known non-feedback achievability results for DMCs and AWGN channels, respectively. A common technique used in both converse proofs is a result from binary hypothesis testing, which is used in the derivation of Lemma 1 in our paper and a similar result in [15, (17)]. We then proceed with the proof by using a Berry-Esseen-type result for bounded martingale difference sequences whereas [15] uses the usual Berry-Esseen theorem by first showing equality in distribution of the information density with a sum of i.i.d. random variables.

II Preliminaries

Let $\mathcal{A}$ and $\mathcal{B}$ be finite input and output alphabets, respectively, of the DMC $W$ , where $W$ is a stochastic matrix from $\mathcal{A}$ to $\mathcal{B}$ . For a given sequence $x^{n}\in\mathcal{A}^{n}$ , the $n$ -type $t=t(x^{n})$ of $x^{n}$ is defined as

\displaystyle t(a)

\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathds{1}(x_{i}=a)

for all $a\in\mathcal{A}$ , where $\mathds{1}(.)$ is the standard indicator function. For a given sequence $x^{n}\in\mathcal{A}^{n}$ , we will use $t(x^{n})$ or $P_{x^{n}}$ to denote its type. Let $\mathcal{P}_{n}(\mathcal{A})$ be the set of $n$ -types on $\mathcal{A}$ . For a given $t\in\mathcal{P}_{n}(\mathcal{A})$ , $T^{n}_{\mathcal{A}}(t)$ denotes the type class, i.e., the set of sequences $x^{n}\in\mathcal{A}^{n}$ with empirical distribution equal to $t$ . For a random variable $Z$ , $||Z||_{\infty}$ denotes its essential supremum (that is, the infimum of those numbers $z$ such that $\mathbb{P}(Z\leq z)=1$ ). We will write $\log$ to denote logarithm to the base $e$ and $\exp(x)$ to denote $e$ to the power of $x$ . The cost function is denoted by $c(\cdot)$ where $c:\mathcal{A}\to[0,c_{\max}]$ and $c_{\max}>0$ is a constant. Let $\Gamma_{0}=\min_{a\in\mathcal{A}}c(a)$ . Let $\Gamma^{*}$ denote the smallest $\Gamma$ such that the capacity-cost function $C(\Gamma)$ is equal to the unconstrained capacity. We assume $\Gamma^{*}>\Gamma_{0}$ and $\Gamma\in(\Gamma_{0},\Gamma^{*})$ throughout the paper. For $\Gamma\in(\Gamma_{0},\Gamma^{*})$ , the capacity-cost function is defined as

\displaystyle C(\Gamma)

\displaystyle=\max_{\begin{subarray}{c}P\in\mathcal{P}(\mathcal{A})\\ c(P)\leq\Gamma\end{subarray}}I(P,W),

(7)

where $c(P):=\sum_{a\in\mathcal{A}}P(a)c(a)$ . The function $C(\Gamma)$ is strictly increasing and differentiable [1, Problem 8.4] in the interval $(\Gamma_{0},\Gamma^{*})$ . For a given $x^{n}\in\mathcal{A}^{n}$ , we define

\displaystyle c(x^{n}):=\frac{1}{n}\sum_{i=1}^{n}c(x_{i}).

Let $\Pi_{W,\Gamma}^{*}$ be the set of all capacity-cost-achieving distributions, i.e., the set of maximizing distributions in $(\ref{main_form})$ . For any $P^{*}\in\Pi_{W,\Gamma}^{*}$ , let $Q^{*}=P^{*}W$ be the marginal distribution on $\mathcal{B}$ . Note that the output distribution $Q^{*}$ is always unique, and without loss of generality, $Q^{*}$ can be assumed to satisfy $Q^{*}(b)>0$ for all $b\in\mathcal{B}$ [21, Corollaries 1 and 2 to Theorem 4.5.1].

The following definitions will remain in effect throughout the paper:

	$\displaystyle\nu_{a}$	$\displaystyle:=\text{Var}\left(\log\frac{W(Y\|a)}{Q^{*}(Y)}\right),\quad\text{ where }Y\sim W(\cdot\|a),$
	$\displaystyle\nu_{\max}$	$\displaystyle:=\max_{a\in\mathcal{A}}\nu_{a},$
	$\displaystyle i(a,b)$	$\displaystyle:=\log\frac{W(b\|a)}{Q^{*}(b)}.$

Definition 1 (cf. [4])

A DMC $W$ is called simple-dispersion at the cost $\Gamma\in(\Gamma_{0},\Gamma^{*})$ if

\displaystyle\min_{P^{*}\in\Pi_{W,\Gamma}^{*}}\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}=\max_{P^{*}\in\Pi_{W,\Gamma}^{*}}\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}.

We will only focus on simple-dispersion channels for a fixed cost $\Gamma\in(\Gamma_{0},\Gamma^{*})$ and thus define

V(\Gamma):=\sum_{a\in\mathcal{A}}P^{*}(a)\nu_{a}

for any $P^{*}\in\Pi_{W,\Gamma}^{*}$ .

With a blocklength $n$ and a fixed rate $R>0$ , let $\mathcal{M}_{R}=\{1,\ldots,\lceil\exp(nR)\rceil\}$ denote the message set. Let $M\in\mathcal{M}_{R}$ denote the random message drawn uniformly from the message set.

Definition 2

An $(n,R)$ code for a DMC consists of an encoder $f$ which, for each message $m\in\mathcal{M}_{R}$ , chooses an input $X^{n}=f(m)\in\mathcal{A}^{n}$ , and a decoder $g$ which maps the output $Y^{n}$ to $\hat{m}\in\mathcal{M}_{R}$ . The code $(f,g)$ is random if $f$ or $g$ is random.

Definition 3

An $(n,R)$ code with ideal feedback for a DMC consists of an encoder $f$ which, at each time instant $k$ ( $1\leq k\leq n$ ) and for each message $m\in\mathcal{M}_{R}$ , chooses an input $x_{k}=f(m,x^{k-1},y^{k-1})\in\mathcal{A}$ , and a decoder $g$ which maps the output $y^{n}$ to $\hat{m}\in\mathcal{M}_{R}$ . The code $(f,g)$ is random if $f$ or $g$ is random.

Definition 4

An $(n,R,\Gamma)$ code for a DMC is an $(n,R)$ code such that $c(X^{n})\leq\Gamma$ almost surely, where the message $M\sim\text{Unif}(\mathcal{M}_{R})$ has a uniform distribution over the message set $\mathcal{M}_{R}$ .

Definition 5

An $(n,R,\Gamma)$ code with ideal feedback for a DMC is an $(n,R)$ code with ideal feedback such that $c(X^{n})\leq\Gamma$ almost surely, where the message $M\sim\text{Unif}(\mathcal{M}_{R})$ has a uniform distribution over the message set $\mathcal{M}_{R}$ .

Definition 6

An $(n,R,\Gamma,V)$ code for a DMC is an $(n,R)$ code such that $\mathbb{E}\left[\sum_{i=1}^{n}c(X_{i})\right]\leq n\Gamma$ and $\text{Var}\left(\sum_{i=1}^{n}c(X_{i})\right)\leq nV$ , where the message $M\sim\text{Unif}(\mathcal{M}_{R})$ has a uniform distribution over the message set $\mathcal{M}_{R}$ .

Definition 7

An $(n,R,\Gamma,V)$ code with ideal feedback for a DMC is an $(n,R)$ code with ideal feedback such that $\mathbb{E}\left[\sum_{i=1}^{n}c(X_{i})\right]\leq n\Gamma$ and $\text{Var}\left(\sum_{i=1}^{n}c(X_{i})\right)\leq nV$ , where the message $M\sim\text{Unif}(\mathcal{M}_{R})$ has a uniform distribution over the message set $\mathcal{M}_{R}$ .

Given $\epsilon\in(0,1)$ , define

\displaystyle M^{*}_{\text{fb}}(n,\epsilon,\Gamma):=\max\{\lceil\exp(nR)\rceil:\bar{P}_{\text{e,fb}}(n,R,\Gamma)\leq\epsilon\},

where $\bar{P}_{\text{e,fb}}(n,R,\Gamma)$ denotes the minimum average error probability attainable by any random $(n,R,\Gamma)$ code with feedback. Similarly, define

\displaystyle M^{*}(n,\epsilon,\Gamma):=\max\{\lceil\exp(nR)\rceil:\bar{P}_{\text{e}}(n,R,\Gamma)\leq\epsilon\},

where $\bar{P}_{\text{e}}(n,R,\Gamma)$ denotes the minimum average error probability attainable by any random $(n,R,\Gamma)$ code without feedback. Define $M^{*}_{\text{fb}}(n,\epsilon,\Gamma,V)$ and $M^{*}(n,\epsilon,\Gamma,V)$ similarly for codes with mean and variance cost constraints.

II-A Expectation-only cost constraint

Under this cost formulation, the average cost of the codewords is constrained in expectation only:

\displaystyle\mathbb{E}\left[\frac{1}{n}\sum_{i=1}^{n}c(X_{i})\right]

\displaystyle\leq\Gamma.

(8)

We now illustrate a codebook construction (adapted from [17]) with an average error probability at most $\epsilon\in(0,1)$ that meets the cost threshold $\Gamma$ according to $(\ref{exp_cost})$ but the cost of its codewords is non-ergodic, i.e., $\frac{1}{n}\sum_{i=1}^{n}c(X_{i})$ does not converge to $\Gamma$ . Consider a codebook $\mathcal{C}_{n}$ with rate $C(\Gamma)<R<C\left(\frac{\Gamma}{1-\epsilon}\right)$ whose average error probability $\epsilon_{n}\to 0$ and each of whose codewords has average cost equal to $\frac{\Gamma}{1-\epsilon}$ . Such a codebook exists because $R<C\left(\frac{\Gamma}{1-\epsilon}\right)$ . Assuming $\Gamma_{0}=\min_{a\in\mathcal{A}}c(a)=0$ without loss of generality, one could modify the codebook $\mathcal{C}_{n}$ by replacing an $\epsilon$ -fraction of its codewords with the all-zero codeword. The modified codebook $\mathcal{C}_{n}^{\prime}$ has average error probability at most $\epsilon_{n}^{\prime}\to\epsilon$ and meets the cost threshold $\Gamma$ according to $(\ref{exp_cost})$ . But $\frac{1}{n}\sum_{i=1}^{n}c(X_{i})$ is either $0$ or $\frac{\Gamma}{1-\epsilon}$ . This construction also shows that the strong converse does not hold under the expected cost constraint.

The mean and variance cost constraints ensure that the average cost of the codewords concentrate around the cost threshold $\Gamma$ , thereby disallowing codebook constructions with irregular or non-ergodic power consumption.

III Main Results

We prove coding performance improvement in terms of the second-order coding rate, although equivalent results in terms of the average error probability improvement can also be shown as in [8, Theorems 1-3]. Let

•

$r_{\text{a.s.}}(\epsilon,\Gamma)$ denote the optimal SOCR with the a.s. cost constraint without feedback,
•

$r_{\text{a.s.,fb}}(\epsilon,\Gamma)$ denote the optimal SOCR with the a.s. cost constraint with feedback,
•

$r_{\text{m.v.}}(\epsilon,\Gamma,V)$ denote the optimal SOCR with the mean and variance cost constraints without feedback and
•

$r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)$ denote the optimal SOCR with the mean and variance cost constraints with feedback

for channel codes operating with average error probability of at most $\epsilon\in(0,1)$ . In the non-feedback case, $C(\Gamma)$ is the optimal first-order rate for DMCs with the a.s. cost constraint [1, Theorem 6.11] as well as the mean and variance cost formulation [8, Theorems 1 and 2], i.e.,

	$\displaystyle\lim_{n\to\infty}\frac{1}{n}\log M^{*}(n,\epsilon,\Gamma)$	$\displaystyle=C(\Gamma)$		(9)
	$\displaystyle\lim_{n\to\infty}\frac{1}{n}\log M^{*}(n,\epsilon,\Gamma,V)$	$\displaystyle=C(\Gamma)$		(10)

for all $\epsilon\in(0,1)$ . The results $(\ref{str1})$ and $(\ref{str2})$ imply that the strong converse holds. We thus define the second-order rates with respect to the capacity-cost function $C(\Gamma)$ as follows:

	$\displaystyle r_{\text{a.s.}}(\epsilon,\Gamma)$	$\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}$
	$\displaystyle r_{\text{m.v.}}(\epsilon,\Gamma,V)$	$\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}(n,\epsilon,\Gamma,V)-nC(\Gamma)}{\sqrt{n}}.$

For the feedback case, we simply take the convention to define the SOCR with respect to $C(\Gamma)$ as follows:

	$\displaystyle r_{\text{a.s.,fb}}(\epsilon,\Gamma)$	$\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}$
	$\displaystyle r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)$	$\displaystyle:=\liminf_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma,V)-nC(\Gamma)}{\sqrt{n}}.$

For the a.s. cost constraint, this convention is justified because from the result in Theorem 2, $C(\Gamma)$ is the optimal first-order rate, in the analogous sense to $(\ref{str1})$ , for DMCs with feedback. For DMCs without cost constraints, Shannon [22, Theorem 6] showed that feedback does not increase the capacity.

III-A Performance improvement for non-feedback codes

From [3, Theorem 3], we have $r_{\text{a.s.}}(\epsilon,\Gamma)=\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon)$ for a simple-dispersion²²2The result in [3, Theorem 3] is not restricted to simple-dispersion DMCs. DMC $W$ . On the other hand, [8, Theorems 1 and 2] proved that

\displaystyle r_{\text{m.v.}}(\epsilon,\Gamma,V)=\max\left\{r\in\mathbb{R}:\mathcal{K}\left(\frac{r}{\sqrt{V(\Gamma)}},\frac{C^{\prime}(\Gamma)^{2}V}{V(\Gamma)}\right)\leq\epsilon\right\}

(11)

for a DMC $W$ such that $|\Pi_{W,\Gamma}^{*}|=1$ and $V(\Gamma)>0$ , where the function $\mathcal{K}:\mathbb{R}\times(0,\infty)\to(0,1)$ is given by

\displaystyle\mathcal{K}\left(r,V\right)

\displaystyle=\min_{\begin{subarray}{c}\Pi:\\ \mathbb{E}[\Pi]=r\\ \text{Var}(\Pi)\leq V\\ |\text{supp}(\Pi)|\leq 3\end{subarray}}\mathbb{E}\left[\Phi(\Pi)\right].

(12)

The maximum and the minimum in $(\ref{max5})$ and $(\ref{min5})$ , respectively, are attained [8, Lemmas 3 and 4].

Theorem 1

Fix an arbitrary $\epsilon\in(0,1)$ . Then for any $\Gamma\in(\Gamma_{0},\Gamma^{*})$ , $V>0$ and a DMC $W$ such that $|\Pi_{W,\Gamma}^{*}|=1$ and $V(\Gamma)>0$ , we have $r_{\text{m.v.}}(\epsilon,\Gamma,V)>r_{\text{a.s.}}(\epsilon,\Gamma)$ .

Proof: The proof is given in Section IV.

The improvement in Theorem 1 is shown in Fig. 1 for a binary symmetric channel. Specifically, the second-order coding rate as a function of the average error probability is shown in Fig. 1 and Fig. 2 for a binary symmetric channel with parameter $p=0.3$ , alphabets $\mathcal{A}=\mathcal{B}=\{0,1\}$ , cost threshold $\Gamma=0.2$ and cost function $c(x)=x$ .

Refer to caption — Figure 1: The SOCR is compared between the almost-sure cost constraint and the mean and variance cost constraints for different values of $V$ . The plots for the mean and variance cost constraints are lower bounds to the SOCR since they are obtained through a non-exhaustive search of the feasible region in the maximization and minimization in $(\ref{max5})$ and $(\ref{min5})$ , respectively.

As discussed in $(\ref{criticaln})$ , the choice of $V$ together with the desired values of $\alpha$ and $\beta$ specifies the critical blocklength exceeding which guarantees the $(\alpha,\beta)$ -ergodicity of the coding scheme. In practice, the choice of blocklength is more fundamental as it affects complexity and latency. Therefore, it is more prudent for the value of $V$ to be dictated by the blocklength $n$ and the desired $(\alpha,\beta)$ -ergodicity via the relation $V=n\beta\alpha^{2}\Gamma^{2}$ derived from $(\ref{criticaln})$ . With the same channel $(p=0.3)$ and cost $(\Gamma=0.2)$ parameters as used in Fig. 1, Fig. 2 shows the second-order performance for different critical blocklengths for an $(\alpha,\beta)$ -ergodic codebook with $\alpha=\beta=0.1$ .

III-B Performance improvement for feedback codes

Definition 8

A controller is a function $F:(\mathcal{A}\times\mathcal{B})^{*}\to\mathcal{P}(\mathcal{A})$ .

Random feedback codes can be constructed by controllers. Given a message $m\in\mathcal{M}_{R}$ and the past channel inputs and outputs $(x^{k-1},y^{k-1})$ , the channel input $X_{k}=f(m,x^{k-1},y^{k-1})$ at time instant $k$ is distributed according to $F(x^{k-1},y^{k-1})$ . There is a one-to-one correspondence between a random feedback code and a controller-based code.

•

A random feedback code $(f,g)$ is equivalently specified by the joint distribution

\displaystyle p_{M,X^{n},Y^{n},\hat{M}}=p_{M}\left(\prod_{i=1}^{n}p_{X_{i}|M,X^{i-1},Y^{i-1}}p_{Y_{i}|X_{i}}\right)p_{\hat{M}|Y^{n}},

where $\hat{M}$ is the decoded message. Marginalizing out $M$ and $\hat{M}$ , we obtain

\displaystyle p_{X^{n},Y^{n}}=\prod_{i=1}^{n}p_{X_{i}|X^{i-1},Y^{i-1}}\,p_{Y_{i}|X_{i}}

from which a controller $F$ can be obtained with the specification

\displaystyle F(x_{k}|x^{k-1},y^{k-1})=p_{X_{k}|X^{k-1},Y^{k-1}}(x_{k}|x^{k-1},y^{k-1})

for each time $k$ .

•

Likewise, a controller $F$ specifies a random feedback code by inducing the following joint distribution:

\displaystyle p_{M,X^{n},Y^{n},\hat{M}}(m,x^{n},y^{n},\hat{m})=p_{M}(m)\left(\prod_{i=1}^{n}F(x_{i}|x^{i-1},y^{i-1})p_{Y_{i}|X_{i}}(y_{i}|x_{i})\right)p_{\hat{M}|Y^{n}}(\hat{m}|y^{n}).

A controller $F$ along with the channel $W$ specify a joint distribution over $\mathcal{A}^{n}\times\mathcal{B}^{n}$ with probability assignments given by

\displaystyle(F\circ W)(x^{n},y^{n})=\prod_{k=1}^{n}F(x_{k}|x^{k-1},y^{k-1})W(y_{k}|x_{k}).

(13)

Lemma 1

Consider a channel $W$ with cost constraint $\Gamma\in(\Gamma_{0},\Gamma^{*})$ . Then for every $n,\rho>0$ and $\epsilon\in(0,1)$ ,

\displaystyle\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)

\displaystyle\leq\log\rho-\log\left[\left(1-\epsilon-\sup_{F}\,\inf_{q\in\mathcal{P}(\mathcal{B}^{n})}(F\circ W)\left(\frac{W(Y^{n}|X^{n})}{q(Y^{n})}>\rho\right)\right)^{+}\right],

(14)

where the supremum in $(\ref{qandp})$ is over all controllers $F$ satisfying

	$\displaystyle(F\circ W)\left(c(X^{n})\leq\Gamma\right)$	$\displaystyle=\sum_{x^{n}:c(x^{n})\leq\Gamma}\sum_{y^{n}\in\mathcal{B}^{n}}\prod_{k=1}^{n}F(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
		$\displaystyle\,=1.$		(15)

Proof: Lemma 1 is similar to [23, Theorem 27], [18, (42)], [4, Lemma 15], [8, Lemma 2] and others. Its proof is omitted.

Theorem 2

Fix an arbitrary $\epsilon\in(0,1)$ . Then for any $\Gamma\in(\Gamma_{0},\Gamma^{*})$ and a simple-dispersion DMC $W$ such that $V(\Gamma)>0$ , we have $r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma)$ .

Proof: The proof is given in Section V.

We only prove a converse result that is the following upper bound:

\displaystyle r_{\text{a.s.,fb}}(\epsilon,\Gamma)\leq\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon).

(16)

The result of Theorem 2 follows by combining $(\ref{4x})$ with the existing achievability result (without feedback) from [3, Theorem 3].

From Theorems 1 and 2 alone, we have that $r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{a.s.,fb}}(\epsilon,\Gamma,V)$ . More importantly, the mean and variance cost formulation admits feedback mechanisms that improve the SOCR, even if the capacity-cost-achieving distribution is unique, i.e., $|\Pi_{W,\Gamma}^{*}|=1$ . This is the more interesting case since for compound-dispersion channels where $|\Pi_{W,\Gamma}^{*}|>1$ , feedback is already known to improve second-order performance via timid/bold coding [4].

In contrast to the almost-sure constraint where $r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma)$ , we observe that $r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{m.v.}}(\epsilon,\Gamma,V)$ for simple-dispersion channels with $|\Pi_{W,\Gamma}^{*}|=1$ [8, Theorem 3]. In summary, for any $\epsilon\in(0,1)$ , $\Gamma\in(\Gamma_{0},\Gamma^{*})$ , $V>0$ and a simple-dispersion DMC $W$ such that $|\Pi_{W,\Gamma}^{*}|=1$ and $V(\Gamma)>0$ ,

\displaystyle r_{\text{m.v.,fb}}(\epsilon,\Gamma,V)>r_{\text{m.v.}}(\epsilon,\Gamma,V)>r_{\text{a.s.,fb}}(\epsilon,\Gamma)=r_{\text{a.s.}}(\epsilon,\Gamma),

where the last equality above has been proven without the assumption $|\Pi_{W,\Gamma}^{*}|=1$ .

IV Proof of Theorem 1

Since $\mathcal{K}(r,V)$ is a continuous function [8, Lemma 3], it suffices to show that for all $r\in\mathbb{R}$ and $V>0$ ,

\displaystyle\min_{\begin{subarray}{c}\Pi:\\ \mathbb{E}[\Pi]=r\\ \text{Var}(\Pi)\leq V\\ |\text{supp}(\Pi)|\leq 2\end{subarray}}\mathbb{E}\left[\Phi(\Pi)\right]<\Phi(r).

(17)

The LHS of $(\ref{bv})$ can be written as

\displaystyle\min_{\begin{subarray}{c}p,\pi:\\ 0\leq p\leq 1\\ \frac{p}{1-p}(\pi-r)^{2}\leq V\end{subarray}}\left[p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\right],

where we used the constraint $\mathbb{E}[\Pi]=r$ to eliminate one of the decision variables.

Assume by contradiction that

\displaystyle p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\geq\Phi(r)

(18)

for all $\pi\geq r$ , $p\in[0,1]$ and $\frac{p}{1-p}(\pi-r)^{2}\leq V$ . The assumption $\pi\geq r$ is without loss of generality since one of the two point masses must be greater than or equal to $r$ .

If $(\ref{eq})$ holds, then

\displaystyle p\Phi(\pi)+(1-p)\Phi\left(\frac{r-p\pi}{1-p}\right)\geq\Phi(r)

for all $\pi\geq r$ , $p\in[0,1]$ and $\frac{p}{1-p}(\pi-r)^{2}=V$ . Since $\pi=r+\sqrt{\frac{V(1-p)}{p}}$ in this case, we must have

$\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(\frac{r}{1-p}-\frac{p}{1-p}\left(r+\sqrt{\frac{V(1-p)}{p}}\right)\right)$	$\displaystyle\geq\Phi(r)$
$\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\frac{p}{1-p}\sqrt{\frac{V(1-p)}{p}}\right)$	$\displaystyle\geq\Phi(r)$
$\displaystyle p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)$	$\displaystyle\geq\Phi(r)$	(19)

for all $p\in[0,1]$ .

Consider the function

\displaystyle f(p)

\displaystyle=p\,\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+(1-p)\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\Phi(r)

with domain $p\in[0,1]$ . For any $p\in(0,1)$ ,

$\displaystyle\frac{f(p)-f(0)}{p}$	$\displaystyle=\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1-p}{p}\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\frac{\Phi(r)}{p}$
	$\displaystyle\leq\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1}{p}\Phi\left(r-\sqrt{\frac{Vp}{1-p}}\right)-\frac{\Phi(r)}{p}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)+\frac{1}{p}\left[\Phi(r)-\phi(\tilde{r})\sqrt{\frac{Vp}{1-p}}-\Phi(r)\right]$
	$\displaystyle=\Phi\left(r+\sqrt{\frac{V(1-p)}{p}}\right)-\phi(\tilde{r})\sqrt{\frac{V}{p(1-p)}}.$	(20)

In equality $(a)$ , we have $r-\sqrt{\frac{Vp}{1-p}}<\tilde{r}<r$ by the mean value theorem. It is easy to see that for sufficiently small $p>0$ , the expression in $(\ref{co})$ is negative. Since $f(0)=0$ , we have $f(p)<0$ for some $p>0$ , which contradicts $(\ref{gp})$ .

V Proof of Theorem 2

For any $t\in\mathcal{P}_{n}(\mathcal{A})$ , define

\displaystyle d_{W}(t):=\inf_{P\in\Pi_{W,\Gamma}^{*}}||t-P||_{1}.

For any $0<\gamma\leq\frac{V(\Gamma)}{4|\mathcal{A}|\nu_{\max}}$ , define

\displaystyle\begin{split}\mathcal{P}_{n}^{\gamma}&=\left\{x^{n}\in\mathcal{A}^{n}:c(x^{n})\leq\Gamma\text{ and }d_{W}(t(x^{n}))>\gamma\right\}\\ \mathcal{P}_{n}^{\gamma,c}&=\left\{x^{n}\in\mathcal{A}^{n}:c(x^{n})\leq\Gamma\text{ and }d_{W}(t(x^{n}))\leq\gamma\right\}.\end{split}

(21)

Definition 9

For any distribution $P\in\mathcal{P}(\mathcal{A})$ and $S\subset\mathcal{A}$ such that $P(S)>0$ , define the probability measure

\displaystyle P|_{S}(x)=\begin{cases}\frac{P(x)}{P(S)}&x\in S\\ 0&\text{otherwise}.\end{cases}

Definition 10

For any $k\geq 0$ and any $x^{k}\in\mathcal{A}^{k}$ , let³³3For $k\geq n$ , $\mathcal{A}_{x^{k}}=\emptyset$ and for $k=0$ , $(x^{k},x)=(x,)$ .

\displaystyle\mathcal{A}_{x^{k}}=\left\{x\in\mathcal{A}:(x^{k},x)\text{ is a prefix of some }x^{n}\in\mathcal{P}_{n}^{\gamma,c}\right\}.

Fix $(a_{0},a_{1},\ldots,a_{n-1})\in\mathcal{P}_{n}^{\gamma,c}$ arbitrarily and let $\mathds{1}_{a_{i}}\in\mathcal{P}(\mathcal{A})$ denote a single point-mass distribution at $a_{i}$ . Then for any $0\leq k\leq n-1$ , $x^{k}\in\mathcal{A}^{k},y^{k}\in\mathcal{B}^{k}$ and a controller $F$ satisfying $(\ref{vc})$ , we define the controller $F_{\gamma}$ as

\displaystyle F_{\gamma}(x^{k},y^{k})

\displaystyle:=\begin{cases}F(x^{k},y^{k})|_{\mathcal{A}_{x^{k}}}&\text{ if }F(\mathcal{A}_{x^{k}}|x^{k},y^{k})>0\\ \text{Unif}(A_{x^{k}})&\text{ if }F(\mathcal{A}_{x^{k}}|x^{k},y^{k})=0\text{ and }|\mathcal{A}_{x^{k}}|\neq 0\\ \mathds{1}_{a_{k}}&\text{ otherwise}.\end{cases}

Remark 1

Given any controller $F$ satisfying $(\ref{vc})$ , Definition 10 constructs a modified controller $F_{\gamma}$ which satisfies

	$\displaystyle(F_{\gamma}\circ W)\left(\mathcal{P}_{n}^{\gamma,c}\right)$	$\displaystyle:=\sum_{x^{n}\in\mathcal{P}_{n}^{\gamma,c}}\sum_{y^{n}\in\mathcal{B}^{n}}\prod_{k=1}^{n}F_{\gamma}(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
		$\displaystyle=1.$		(22)

Intuitively, $F_{\gamma}$ amplifies the probability assignments of $F$ over the set $\mathcal{P}_{n}^{\gamma,c}$ and nullifies the probability assignments of $F$ over the set $\mathcal{P}_{n}^{\gamma}$ so that $X^{n}\in\mathcal{P}_{n}^{\gamma,c}$ almost surely for $(X^{n},Y^{n})\sim(F_{\gamma}\circ W)$ . Definition 10 is inspired by, and corrects an error in, [4, Def. 8]. With the definition given in [4, Def. 8], the analogue of (22) does not hold, although it is asserted in the proof of [4, Thm. 3]. This can be rectified by using Definition 10 in place of [4, Def. 8]. An analogous comment applies to the next definition and [4, Def. 9].

Definition 11

For any type $t\in\mathcal{P}_{n}(\mathcal{A})$ such that $T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}$ , $k\geq 0$ and any $x^{k}\in\mathcal{A}^{k}$ , let

\displaystyle\mathcal{A}_{x^{k}}^{t}=\left\{x\in\mathcal{A}:(x^{k},x)\text{ is a prefix of some }x^{n}\in T^{n}_{\mathcal{A}}(t)\right\}.

Fix $(a_{0},a_{1},\ldots,a_{n-1})\in T^{n}_{\mathcal{A}}(t)$ arbitrarily and let $\mathds{1}_{a_{i}}\in\mathcal{P}(\mathcal{A})$ denote a single point-mass distribution at $a_{i}$ . Then for any $0\leq k\leq n-1$ , $x^{k}\in\mathcal{A}^{k},y^{k}\in\mathcal{B}^{k}$ and a controller $F$ satisfying $(\ref{vc})$ , we define the controller $F_{t}$ as

\displaystyle F_{t}(x^{k},y^{k})

\displaystyle:=\begin{cases}F(x^{k},y^{k})|_{\mathcal{A}_{x^{k}}^{t}}&\text{ if }F(\mathcal{A}^{t}_{x^{k}}|x^{k},y^{k})>0\\ \text{Unif}(A^{t}_{x^{k}})&\text{ if }F(\mathcal{A}^{t}_{x^{k}}|x^{k},y^{k})=0\text{ and }|\mathcal{A}_{x^{k}}|\neq 0\\ \mathds{1}_{a_{k}}&\text{ otherwise}.\end{cases}

Remark 2

Given any controller $F$ satisfying $(\ref{vc})$ , Definition 11 constructs a modified controller $F_{t}$ which satisfies $(F_{t}\circ W)\left(T^{n}_{\mathcal{A}}(t)\right)=1$ for $T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}$ .

Now let

\displaystyle q(y^{n})=\frac{1}{2}\prod_{i=1}^{n}Q^{*}(y_{i})+\frac{1}{2}\frac{1}{|\mathcal{P}_{n}(\mathcal{A})|}\sum_{t\in\mathcal{P}_{n}(\mathcal{A})}\prod_{i=1}^{n}q_{t}(y_{i}),

(23)

where

\displaystyle q_{t}(b):=\sum_{a\in\mathcal{A}}t(a)W(b|a).

Let $P$ denote the distribution $F\circ W$ . Let $P_{\gamma}$ denote the distribution $F_{\gamma}\circ W$ . Let $P_{t}$ denote the distribution $F_{t}\circ W$ for each $t\in\mathcal{P}_{n}(\mathcal{A})$ such that $T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}$ . Note that all controllers $F,F_{\gamma}$ and $F_{t}$ satisfy $(\ref{vc})$ . We have

	$\displaystyle P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle=P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))>\gamma\right)$
	$\displaystyle=P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap t(X^{n})=t\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right).$		(24)

Inequality $(a)$ follows from the following argument. For any $(x^{n},y^{n})$ such that $x^{n}\in\mathcal{P}_{n}^{\gamma,c}$ , note that for all $1\leq k\leq n$ , $\mathcal{A}_{x^{k-1}}\neq\emptyset$ and $x_{k}\in\mathcal{A}_{x^{k-1}}$ so that

\displaystyle F(x_{k}|x^{k-1},y^{k-1})\leq F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1}).

(25)

Then

	$\displaystyle P_{\gamma}\left((x^{n},y^{n})\right)$
	$\displaystyle=\prod_{k=1}^{n}F_{\gamma}(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\geq}}\prod_{k=1}^{n}\frac{F(x_{k}\|x^{k-1},y^{k-1})}{F(\mathcal{A}_{x^{k-1}}\|x^{k-1},y^{k-1})}W(y_{k}\|x_{k})$
	$\displaystyle\geq\prod_{k=1}^{n}F(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
	$\displaystyle=P((x^{n},y^{n})).$

With some abuse of notation, we assume in inequality $(b)$ above that if $F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1})=0$ , then

\frac{F(x_{k}|x^{k-1},y^{k-1})}{F(\mathcal{A}_{x^{k-1}}|x^{k-1},y^{k-1})}=0

which is justified by $(\ref{fw})$ .

A similar derivation gives

\displaystyle P_{t}((x^{n},y^{n}))\geq P(x^{n},y^{n})

for all $(x^{n},y^{n})$ such that $c(x^{n})\leq\Gamma$ and $t(x^{n})=t$ , where $T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}$ .

Let $\rho=\exp\left(nC(\Gamma)+\sqrt{n}r\right)$ , where $r$ will be specified later. Define⁴⁴4This proof follows that of [4, Thm. 3] and corrects an error therein: the $\mathcal{F}_{i}$ filtration defined before (98) should be defined like $\mathcal{G}_{i}$ here.

	$\displaystyle\mathcal{G}_{i}$	$\displaystyle:=\sigma(X_{1},\ldots,X_{i+1},Y_{1},\ldots,Y_{i})$
	$\displaystyle Z_{i}$	$\displaystyle:=i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})\|\mathcal{G}_{i-1}\right]$
	$\displaystyle\mathcal{F}_{i}$	$\displaystyle:=\sigma(Z_{1},Z_{2},\ldots,Z_{i}).$

Two things are important to note here. First, by the Markov property $(X^{i-1},Y^{i-1})-X_{i}-Y_{i}$ , we have $\mathbb{E}\left[i(X_{i},Y_{i})|\mathcal{G}_{i-1}\right]=\mathbb{E}\left[i(X_{i},Y_{i})|X_{i}\right]$ a.s. Second, $\mathcal{F}_{i}\subset\mathcal{G}_{i}$ .

For the first term in $(\ref{q})$ , we can upper bound it as follows:

	$\displaystyle P_{\gamma}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq P_{\gamma}\left(\prod_{i=1}^{n}\frac{W(Y_{i}\|X_{i})}{Q^{*}(Y_{i})}>\frac{\rho}{2}\right)$
	$\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}\left[\log\left(\frac{W(Y_{i}\|X_{i})}{Q^{*}(Y_{i})}\right)-C(\Gamma)\right]>\sqrt{n}r-\log(2)\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\sum_{i=1}^{n}\left[i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})\|\mathcal{G}_{i-1}\right]\right]>\sqrt{n}r-\log(2)\right)$
	$\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right).$		(26)

In inequality $(a)$ , we used the following lemma and the fact that $c(X^{n})\leq\Gamma$ almost surely.

Lemma 2

For $\Gamma\in(\Gamma_{0},\Gamma^{*})$ ,

\displaystyle\mathbb{E}\left[i(X,Y)|X\right]\leq C(\Gamma)-C^{\prime}(\Gamma)\left(\Gamma-c(X)\right)

(27)

almost surely, where $X$ has an arbitrary distribution and $Y$ is the output of the channel $W$ when $X$ is the input.

Proof: See [24, Proposition 1] and its references.

We will now apply a martingale central limit theorem [25, Corollary to Theorem 2] to the expression in $(\ref{MCLT})$ . We first verify that the hypotheses of [25, Corollary to Theorem 2] are satisfied:

First, we require that

\max_{1\leq k\leq n}|Z_{k}|<\infty.

Since $Q^{*}(b)>0$ for all $b\in\mathcal{B}$ by assumption and $W(Y_{k}|X_{k})>0$ almost surely for each channel input and output pair $(X_{k},Y_{k})$ , we have

	$\displaystyle\|Z_{k}\|$	$\displaystyle\leq\max_{a\in\mathcal{A},b\in\mathcal{B}:W(b\|a)>0}2\,\|i(a,b)\|$
		$\displaystyle:=2i_{\max}<\infty$

for all $1\leq k\leq n$ .

Second, we require that

\displaystyle\mathbb{E}\left[Z_{k}|\mathcal{F}_{k-1}\right]=0

almost surely for all $1\leq k\leq n$ [25, p. 672]. This is true because $\mathbb{E}\left[Z_{k}|\mathcal{G}_{k-1}\right]=0$ implies

	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[Z_{k}\|\mathcal{G}_{k-1}\right]\|\mathcal{F}_{k-1}\right]$	$\displaystyle=0$
	$\displaystyle\mathbb{E}\left[Z_{k}\|\mathcal{F}_{k-1}\right]$	$\displaystyle=0.$

Under the above two conditions, it follows from [25, Corollary to Theorem 2] that there exists a constant $\kappa>0$ depending only on $i_{\max}$ such that for any $s\in\mathbb{R}$ ,

\displaystyle P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq s\right)\geq\Phi\left(s\right)-\kappa\left[\frac{n\log n}{\left(\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\right)^{\frac{3}{2}}}+\Bigg{|}\Bigg{|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{|}\Bigg{|}_{\infty}^{1/2}\right].

(28)

Using Lemma 3 in $(\ref{conv})$ , we obtain

	$\displaystyle P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq s\right)$	$\displaystyle\geq\Phi\left(s\right)-\kappa\left[\frac{n\log n}{\left(nV(\Gamma)-n\|\mathcal{A}\|\gamma\nu_{\max}\right)^{\frac{3}{2}}}+\left(\frac{4\|\mathcal{A}\|\gamma\nu_{\max}}{V(\Gamma)}\right)^{1/2}\right]$
		$\displaystyle\geq\Phi(s)-\beta_{\gamma},$		(29)

where the last inequality holds for sufficiently large $n$ for some constant $\beta_{\gamma}>0$ which can be chosen such that $\beta_{\gamma}\to 0$ as $\gamma\to 0$ .

Lemma 3

We have

\displaystyle V(\Gamma)-2\gamma\nu_{\max}\leq\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\leq V(\Gamma)+2\gamma\nu_{\max}.

Furthermore, for $\gamma\leq\frac{V(\Gamma)}{4\nu_{\max}}$ ,

\displaystyle\Bigg{|}\Bigg{|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{|}\Bigg{|}_{\infty}\leq\frac{8\gamma\nu_{\max}}{V(\Gamma)}

almost surely according to the probability measure $P_{\gamma}$ .

Proof: The proof of Lemma 3 is given in Appendix A.

Using the result in $(\ref{finres})$ and Lemma 3 in the expression in $(\ref{MCLT})$ , we obtain

	$\displaystyle P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right)$
	$\displaystyle=P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}>\frac{\sqrt{n}r-\log(2)}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\right)$
	$\displaystyle=1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{\sqrt{n}r-\log(2)}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\right)$
	$\displaystyle\leq\begin{cases}1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)+\|\mathcal{A}\|\gamma\nu_{\max}}}\right)&\text{ if }r\geq\frac{\log 2}{\sqrt{n}}\\ 1-P_{\gamma}\left(\frac{1}{\sqrt{\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]}}\sum_{i=1}^{n}Z_{i}\leq\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)-\|\mathcal{A}\|\gamma\nu_{\max}}}\right)&\text{ if }r<\frac{\log 2}{\sqrt{n}}\end{cases}$
	$\displaystyle\leq\begin{cases}1-\Phi\left(\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)+\|\mathcal{A}\|\gamma\nu_{\max}}}\right)+\beta_{\gamma}&\text{ if }r\geq\frac{\log 2}{\sqrt{n}}\\ 1-\Phi\left(\frac{r-\frac{\log(2)}{\sqrt{n}}}{\sqrt{V(\Gamma)-\|\mathcal{A}\|\gamma\nu_{\max}}}\right)+\beta_{\gamma}&\text{ if }r<\frac{\log 2}{\sqrt{n}}.\end{cases}$		(30)

Let

\displaystyle r=\begin{cases}\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)+\frac{\log 2}{\sqrt{n}}&\text{ if }\epsilon\in\left[\frac{1}{2}-3\beta_{\gamma},1\right)\\ \sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)+\frac{\log 2}{\sqrt{n}}&\text{ if }\epsilon\in\left(0,\frac{1}{2}-3\beta_{\gamma}\right).\end{cases}

(31)

Note that the upper bound in $(\ref{userhere})$ holds for any $r$ . For a given error probability $\epsilon\in(0,1)$ , we choose $r$ according to $(\ref{myrval})$ . Then using $(\ref{myrval})$ in $(\ref{userhere})$ , we obtain for any given $\epsilon\in(0,1)$ that

\displaystyle P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right)

\displaystyle\leq 1-\epsilon-2\beta_{\gamma}.

(32)

$(\ref{firstterm})$ provides an upper bound to the first term in $(\ref{q})$ .

We now upper bound the second term in $(\ref{q})$ .

Using again the choice of $q$ in $(\ref{choiceq})$ , we have

	$\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{\prod_{i=1}^{n}q_{t}(y_{i})}>\frac{\rho}{2\|\mathcal{P}_{n}(\mathcal{A})\|}\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}\|X_{i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\right),$		(33)

where in equality $(a)$ , $(x_{t,1},\ldots,x_{t,n})$ is any arbitrary sequence from the type class $T^{n}_{\mathcal{A}}(t)$ . Equality $(a)$ holds because under the probability measure $P_{t}$ , $t(X^{n})=t$ a.s. (see Remark 2) and the distribution of

\sum_{i=1}^{n}\log\frac{W(Y_{i}|X_{i})}{q_{t}(Y_{i})}

depends on $X^{n}$ only through its type. Continuing from $(\ref{typeetype})$ , we have

	$\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y\|x_{t,i})}{q_{t}(Y)}\right]\right]>\right.$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad n\left[C(\Gamma)-I(t,W)\right]+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\Bigg{)}$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y\|x_{t,i})}{q_{t}(Y)}\right]\right]>n\frac{K}{2}\right)$		(34)

where the last inequality holds for sufficiently large $n$ because $r$ , as defined in $(\ref{myrval})$ , is an $O(1)$ term, and from the construction of the set $\mathcal{P}_{n}^{\gamma}$ , we have

\inf_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}\,d_{W}(t)\geq\gamma>0

which implies

\inf_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}\left[C(\Gamma)-I(t,W)\right]>K

for some constant $K>0$ .

Let $i_{\max,t}:=\max_{a,b:q_{t}(b)W(b|a)>0}\big{|}\log\frac{W(b|a)}{q_{t}(b)}\big{|}$ . We now show that $i_{\max,t}\leq 2\log n$ for all $t$ . Let $W_{\min}:=\min_{a,b:W(b|a)>0}W(b|a)$ and $q_{\min,t}:=\min_{b:q_{t}(b)>0}q_{t}(b)$ . Then

	$\displaystyle q_{\min,t}$	$\displaystyle=\min_{b:q_{t}(b)>0}\sum_{a\in\mathcal{A}}t(a)W(b\|a)$
		$\displaystyle\geq\min_{a,b:W(b\|a)>0}W(b\|a)\min_{a:t(a)>0}t(a)$
		$\displaystyle=\frac{W_{\min}}{n}.$

Thus,

	$\displaystyle i_{\max,t}$	$\displaystyle=\max_{a,b:q_{t}(b)W(b\|a)>0}\big{\|}\log\frac{W(b\|a)}{q_{t}(b)}\big{\|}$
		$\displaystyle\leq\max_{a,b:q_{t}(b)W(b\|a)>0}\big{\|}\log W(b\|a)\big{\|}+\max_{b:q_{t}(b)>0}\big{\|}\log q_{t}(b)\big{\|}$
		$\displaystyle\leq\log\frac{n}{W_{\min}^{2}}$
		$\displaystyle\leq 2\log n$

for all sufficiently large $n$ . Hence, we can use Azuma’s inequality [26, (33), p. 61] to upper bound $(\ref{recycle})$ , giving us

	$\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq(n+1)^{\|\mathcal{A}\|}\exp\left(-\frac{nK^{2}}{128\log^{2}n}\right)$		(35)

which goes to zero as $n\to\infty$ .

Substituting the upper bounds $(\ref{firstterm})$ and $(\ref{pn4})$ in $(\ref{q})$ , we obtain

	$\displaystyle(F\circ W)\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\exp\left(nC(\Gamma)+\sqrt{n}r\right)\right)$
	$\displaystyle\leq 1-\epsilon-\beta_{\gamma}$

for sufficiently large $n$ . Since the controller $F$ was arbitrary, we can apply Lemma 1 to obtain

	$\displaystyle\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)$	$\displaystyle\leq nC(\Gamma)+\sqrt{n}r-\log\beta_{\gamma}$
	$\displaystyle\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}$	$\displaystyle\leq r-\frac{\log\beta_{\gamma}}{\sqrt{n}}$
	$\displaystyle\limsup_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}$	$\displaystyle\leq r^{\prime},$

where $r^{\prime}$ is obtained from the expression of $r$ in $(\ref{myrval})$ after taking the limit as $n\to\infty$ , i.e.,

\displaystyle r^{\prime}=\begin{cases}\sqrt{V(\Gamma)+|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)&\text{ if }\epsilon\in\left[\frac{1}{2}-3\beta_{\gamma},1\right)\\ \sqrt{V(\Gamma)-|\mathcal{A}|\gamma\nu_{\max}}\,\Phi^{-1}\left(\epsilon+3\beta_{\gamma}\right)&\text{ if }\epsilon\in\left(0,\frac{1}{2}-3\beta_{\gamma}\right).\end{cases}

Finally, since $\frac{V(\Gamma)}{4|\mathcal{A}|\nu_{\max}}>\gamma>0$ was arbitrary, we can take $\gamma$ and $\beta_{\gamma}$ arbitrarily small, giving us the converse result

\displaystyle\limsup_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}\leq\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon).

Since this matches the optimal non-feedback SOCR of simple dispersion DMCs with a peak-power cost constraint, we have

\displaystyle\lim_{n\to\infty}\frac{\log M^{*}_{\text{fb}}(n,\epsilon,\Gamma)-nC(\Gamma)}{\sqrt{n}}=\sqrt{V(\Gamma)}\Phi^{-1}(\epsilon)

(36)

for simple-dispersion DMCs with a peak-power cost constraint.

Appendix A Proof of Lemma 3

We have

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\|\mathcal{G}_{i-1}\right]$
	$\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[\left(i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})\|X_{i}\right]\right)^{2}\|\mathcal{G}_{i-1}\right]$
	$\displaystyle=\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[\left(i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})\|X_{i}\right]\right)^{2}\|X_{i}\right]$
	$\displaystyle=\sum_{i=1}^{n}\text{Var}\left(i(X_{i},Y_{i})\|X_{i}\right)$
	$\displaystyle=\sum_{i=1}^{n}\sum_{a\in\mathcal{A}}\mathds{1}\left(X_{i}=a\right)\nu_{a}$
	$\displaystyle=n\sum_{a\in\mathcal{A}}P_{X^{n}}(a)\nu_{a}.$

From Remark 1, since $d_{W}(t(X^{n}))\leq\gamma$ a.s., there exists a $\tilde{P}\in\Pi_{W,\Gamma}^{*}$ such that $||t(X^{n})-\tilde{P}||_{1}\leq 2\gamma$ . Hence,

\displaystyle nV(\Gamma)-2n\gamma\nu_{\max}

\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}|\mathcal{G}_{i-1}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}

a.s., where we used the fact that $W$ is simple-dispersion at the cost $\Gamma$ . Furthermore, since $\mathcal{F}_{i-1}\subset\mathcal{G}_{i-1}$ ,

\displaystyle nV(\Gamma)-2n\gamma\nu_{\max}

\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}|\mathcal{F}_{i-1}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}

a.s. and

\displaystyle nV(\Gamma)-2n\gamma\nu_{\max}

\displaystyle\leq\sum_{i=1}^{n}\mathbb{E}_{\gamma}\left[Z_{i}^{2}\right]\leq nV(\Gamma)+2n\gamma\nu_{\max}.

Finally, we have

	$\displaystyle\Bigg{\|}\frac{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}\|\mathcal{F}_{i-1}]}{\sum_{i=1}^{n}\mathbb{E}_{\gamma}[Z_{i}^{2}]}-1\Bigg{\|}$
	$\displaystyle\leq\Bigg{\|}\frac{V(\Gamma)+2\gamma\nu_{\max}}{V(\Gamma)-2\gamma\nu_{\max}}-1\Bigg{\|}$
	$\displaystyle=\Bigg{\|}\frac{4\gamma\nu_{\max}}{V(\Gamma)-2\gamma\nu_{\max}}\Bigg{\|}$
	$\displaystyle\leq\frac{8\gamma\nu_{\max}}{V(\Gamma)},$

assuming $\gamma\leq\frac{V(\Gamma)}{4\nu_{\max}}$ .

Acknowledgment

This research was supported by the US National Science Foundation under grant CCF-1956192.

References

[1] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011.
[2] V. Strassen, “Asymptotische abschätzungen in Shannon’s informationstheorie,” in Proc. Trans. 3rd Prague Conf Inf. Theory, Prague, Czech, 1962, pp. 689–723.
[3] M. Hayashi, “Information spectrum approach to second-order coding rate in channel coding,” IEEE Transactions on Information Theory, vol. 55, no. 11, pp. 4947–4966, 2009.
[4] A. B. Wagner, N. V. Shende, and Y. Altuğ, “A new method for employing feedback to improve coding performance,” IEEE Transactions on Information Theory, vol. 66, no. 11, pp. 6660–6681, 2020.
[5] Y. Y. Shkel, V. Y. F. Tan, and S. C. Draper, “Second-order coding rate for $m$ -class source-channel codes,” in 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2015, pp. 620–626.
[6] M. Tomamichel and V. Y. F. Tan, “Second-order coding rates for channels with state,” IEEE Transactions on Information Theory, vol. 60, no. 8, pp. 4427–4448, 2014.
[7] C. E. Shannon, “Probability of error for optimal codes in a gaussian channel,” The Bell System Technical Journal, vol. 38, no. 3, pp. 611–656, 1959.
[8] A. Mahmood and A. B. Wagner, “Channel coding with mean and variance cost constraints,” in 2024 IEEE International Symposium on Information Theory (ISIT), 2024, pp. 510–515.
[9] R. Yates, “A framework for uplink power control in cellular radio systems,” IEEE Journal on Selected Areas in Communications, vol. 13, no. 7, pp. 1341–1347, 1995.
[10] A. Goldsmith and P. Varaiya, “Capacity of fading channels with channel side information,” IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 1986–1992, 1997.
[11] S. Hanly and D. Tse, “Multiaccess fading channels. ii. delay-limited capacities,” IEEE Transactions on Information Theory, vol. 44, no. 7, pp. 2816–2831, 1998.
[12] A. Lozano and N. Jindal, “Transmit diversity vs. spatial multiplexing in modern mimo systems,” IEEE Transactions on Wireless Communications, vol. 9, no. 1, pp. 186–197, 2010.
[13] Y. Polyanskiy, “Channel coding: Non-asymptotic fundamental limits,” Ph.D. dissertation, Dept. Elect. Eng., Princeton Univ., Princeton, NJ, USA, 2010.
[14] A. Sahai, S. Draper, and M. Gastpar, “Boosting reliability over AWGN networks with average power constraints and noiseless feedback,” in Proceedings. International Symposium on Information Theory, 2005. ISIT 2005., 2005, pp. 402–406.
[15] S. L. Fong and V. Y. F. Tan, “Asymptotic expansions for the awgn channel with feedback under a peak power constraint,” in 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 311–315.
[16] A. Mahmood and A. B. Wagner, “Timid/bold coding for channels with cost constraints,” in 2023 IEEE International Symposium on Information Theory (ISIT), 2023, pp. 1442–1447.
[17] V. Kostina and S. Verdú, “Channels with cost constraints: Strong converse and dispersion,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2415–2429, 2015.
[18] S. L. Fong and V. Y. F. Tan, “A tight upper bound on the second-order coding rate of the parallel Gaussian channel with feedback,” IEEE Transactions on Information Theory, vol. 63, no. 10, pp. 6474–6486, 2017.
[19] W. Yang, G. Caire, G. Durisi, and Y. Polyanskiy, “Optimum power control at finite blocklength,” IEEE Transactions on Information Theory, vol. 61, no. 9, pp. 4598–4615, 2015.
[20] V. Y. F. Tan and M. Tomamichel, “The third-order term in the normal approximation for the awgn channel,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2430–2438, 2015.
[21] R. G. Gallager, Information Theory and Reliable Communication. New York, NY, USA: Wiley, 1968.
[22] C. Shannon, “The zero error capacity of a noisy channel,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8–19, 1956.
[23] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
[24] A. Mahmood and A. B. Wagner, “Channel coding with mean and variance cost constraints,” 2024. [Online]. Available: https://arxiv.org/abs/2401.16417
[25] E. Bolthausen, “Exact convergence rates in some martingale central limit theorems,” The Annals of Probability, vol. 10, no. 3, pp. 672–688, Aug. 1982.
[26] B. Bercu, B. Delyon, and E. Rio, Concentration Inequalities for Sums and Martingales. Cham, Switzerland: Springer, 2015.

	$\displaystyle P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle=P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))>\gamma\right)$
	$\displaystyle=P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap d_{W}(t(X^{n}))\leq\gamma\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\cap t(X^{n})=t\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)+\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right).$		(24)

	$\displaystyle P_{\gamma}\left((x^{n},y^{n})\right)$
	$\displaystyle=\prod_{k=1}^{n}F_{\gamma}(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\geq}}\prod_{k=1}^{n}\frac{F(x_{k}\|x^{k-1},y^{k-1})}{F(\mathcal{A}_{x^{k-1}}\|x^{k-1},y^{k-1})}W(y_{k}\|x_{k})$
	$\displaystyle\geq\prod_{k=1}^{n}F(x_{k}\|x^{k-1},y^{k-1})W(y_{k}\|x_{k})$
	$\displaystyle=P((x^{n},y^{n})).$

	$\displaystyle P_{\gamma}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq P_{\gamma}\left(\prod_{i=1}^{n}\frac{W(Y_{i}\|X_{i})}{Q^{*}(Y_{i})}>\frac{\rho}{2}\right)$
	$\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}\left[\log\left(\frac{W(Y_{i}\|X_{i})}{Q^{*}(Y_{i})}\right)-C(\Gamma)\right]>\sqrt{n}r-\log(2)\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}P_{\gamma}\left(\sum_{i=1}^{n}\left[i(X_{i},Y_{i})-\mathbb{E}\left[i(X_{i},Y_{i})\|\mathcal{G}_{i-1}\right]\right]>\sqrt{n}r-\log(2)\right)$
	$\displaystyle=P_{\gamma}\left(\sum_{i=1}^{n}Z_{i}>\sqrt{n}r-\log(2)\right).$		(26)

	$\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{\prod_{i=1}^{n}q_{t}(y_{i})}>\frac{\rho}{2\|\mathcal{P}_{n}(\mathcal{A})\|}\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}\|X_{i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\right)$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}>nC(\Gamma)+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\right),$		(33)

	$\displaystyle\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}P_{t}\left(\frac{W(Y^{n}\|X^{n})}{q(Y^{n})}>\rho\right)$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y\|x_{t,i})}{q_{t}(Y)}\right]\right]>\right.$
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad n\left[C(\Gamma)-I(t,W)\right]+\sqrt{n}r-\log 2(n+1)^{\|\mathcal{A}\|}\Bigg{)}$
	$\displaystyle\leq\sum_{t:T^{n}_{\mathcal{A}}(t)\subset\mathcal{P}_{n}^{\gamma}}W\left(\sum_{i=1}^{n}\left[\log\frac{W(Y_{i}\|x_{t,i})}{q_{t}(Y_{i})}-\mathbb{E}_{W}\left[\log\frac{W(Y\|x_{t,i})}{q_{t}(Y)}\right]\right]>n\frac{K}{2}\right)$		(34)