This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Rates of Teaching and
Learning Under Uncertainty

Yan Hao Ling and Jonathan Scarlett
Abstract

In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number nn of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as nn increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy D(12max(p,q))D\big{(}\frac{1}{2}\|\max(p,q)\big{)}, where pp and qq are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in nn. We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols.

000The authors are with the Department of Computer Science and Department of Mathematics, School of Computing, National University of Singapore (NUS). Jonathan Scarlett is also with the Institute of Data Science, NUS. Emails: [email protected]; [email protected]. This work was presented in part at the 2021 IEEE International Symposium on Information Theory (ISIT).

I Introduction

In several societal and technological domains, one is interested in how agents interact with their environment and with each other to attain goals such as learning information about the environment, conveying this information to other agents, and reaching a common consensus. While a comprehensive theoretical understanding of such problems would likely require highly sophisticated mathematical models, even the simplest models come with unique insights and challenges.

In this paper, we study a recently-proposed model of teaching and learning under uncertainty [1, 2], in which a teacher observes noisy information regarding an unknown 1-bit quantity Θ\Theta (the environment), and seeks to convey information to a student via a noisy channel to facilitate learning Θ\Theta. We establish the optimal learning rate (i.e., exponential decay of the error probability) of this problem, thereby settling two notable conjectures made in [1, 2] described in detail below. In addition, we study a generalization of the problem to more general binary-input discrete memoryless channels, and obtain the optimal learning rate in certain special cases of interest.

I-A Model and Definitions

Refer to caption
Figure 1: Illustration of the problem setup.

We first formalize the model, which is depicted in Figure 1. The unknown quantity of interest is a binary random variable Θ\Theta that takes either value in {0,1}\{0,1\}, each with probability 12\frac{1}{2}.

At each time step i{1,,n}i\in\{1,\dotsc,n\}, the teacher observes Θ\Theta through BSC(pp), a binary symmetric channel with error rate p(0,1/2)p\in\big{(}0,1/2\big{)}:

(Yi=Θ)=1p,(Yi=1Θ)=p.\mathbb{P}(Y_{i}=\Theta)=1-p,~{}~{}~{}\mathbb{P}(Y_{i}=1-\Theta)=p. (1)

The binary symmetric channel is memoryless, meaning that the YiY_{i}’s are conditionally independent given Θ\Theta. In Section III, we will study more general and possibly non-symmetric binary-input discrete memoryless channels.

At time ii, the teacher then transmits binary information X^i\hat{X}_{i} to the student through another binary symmetric channel BSC(qq), where q(0,1/2)q\in\big{(}0,1/2\big{)}. Thus, denoting the student’s observations by ZiZ_{i}, we have

(X^i=Zi)=1q,(X^i=1Zi)=q,\mathbb{P}(\hat{X}_{i}=Z_{i})=1-q,~{}~{}~{}\mathbb{P}(\hat{X}_{i}=1-Z_{i})=q, (2)

and the ZiZ_{i} are conditionally independent given X^i\hat{X}_{i}. Importantly, X^i\hat{X}_{i} must only be a function of Y1,,,YiY_{1},,\ldots,Y_{i}; the teacher cannot look into the future.

At time nn, having received Z1,,ZnZ_{1},\ldots,Z_{n}, the student makes an estimate of Θ\Theta, which we denote by Θ^n\hat{\Theta}_{n} (or sometimes simply Θ^\hat{\Theta}). Then, the learning rate is defined as

=lim supn{1nln(Θ^nΘ)}.\mathcal{R}=\limsup_{n\rightarrow\infty}\left\{-\frac{1}{n}\ln\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\right\}. (3)

We assume that the noise parameters pp and qq are known to both the teacher and student; removing this assumption may be an interesting direction for future work.

I-B Existing Results

The preceding motivation and setup follows that of Jog and Loh [1], and the same problem was also considered with different motivation and terminology by Huleihel, Polyanskiy, and Shayevitz [2], who framed the problem using the terminology of relay channels. While relay channels are already a central topic in information theory, the authors of [2] further motivated their study via the more recent information velocity problem, which was posed by Yury Polyanskiy [2] and is also captured under a general framework studied by Rajagopalan and Schulman [3]. In that problem, the goal is to reliably transmit a single bit over a long chain of relays while maintaining a non-vanishing ratio between the number of hops and the total transmission time. Below we will discuss a conjecture (for the 2-hop setting) made in [2] that is directly inspired by the connection with information velocity.

It is known that the student’s optimal strategy is the maximum likelihood decoder (assuming the teacher’s strategy is known) [2, 1]. In [1], it was discussed that optimal strategies can be difficult to analyze in general, and accordingly, the following two simpler student strategies were considered:

  • Majority rule: Θn\Theta_{n} is chosen according to the majority of the observations Z1,,ZnZ_{1},\ldots,Z_{n}.

  • ϵ\epsilon-majority rule: Θn\Theta_{n} is the majority of the last ϵn\epsilon\cdot n observations for some ϵ(0,1)\epsilon\in(0,1), with the rough idea being to allow time for the teacher to learn Θ\Theta first.

These student strategies were analyzed together with the following teacher strategies:

  • Simple forwarding: The teacher directly forwards its observation at each time step, i.e., X^i=Yi\hat{X}_{i}=Y_{i}.

  • Cumulative teaching: The teacher transmits its best estimate of Θ\Theta at each time step, i.e., X^i\hat{X}_{i} is the majority value among Y1,,YiY_{1},\ldots,Y_{i}.

It was demonstrated in [1] that none of these strategies uniformly outperform one an another, and that any combination of them falls short of the D(1/2max(p,q))D(1/2\|\max(p,q)) upper bound (i.e., converse) based on data processing arguments (here and subsequently, D(p1p2)D(p_{1}\|p_{2}) denotes the relative entropy between Bernoulli distributions with parameters p1p_{1} an p2p_{2}). The optimal learning rate was left as an open problem, and it was conjectured that at least part of the weakness is in the upper bound, i.e., that one cannot attain a rate of D(1/2max(p,q))D(1/2\|\max(p,q)).

In [2], both simple forwarding and cumulative teaching (i.e., relaying) were considered, along with a block-structured variant that only updates the majority every n\sqrt{n} time steps. These strategies were analyzed alongside the optimal maximum-likelihood learning (i.e., decoding) rule. While the learning rates attained again fall short of the D(1/2max(p,q))D(1/2\|\max(p,q)) upper bound, it was shown that one comes within a factor 34\frac{3}{4} when p=qp=q and p12p\to\frac{1}{2} (i.e., the high noise setting). It was conjectured in [2] that using a more sophisticated protocol, this constant 34\frac{3}{4} could be made arbitrarily close to one in this high-noise limit. As hinted above, the motivation for this conjecture was the fact that positive information velocity is known to be attainable; the authors of [2] discuss that for this to be possible, the 1-hop and 2-hop exponents should match in the high-noise limit so that “information propagation does not slow down”.

The study of [1] was inspired by earlier works on social learning [4, 5, 6], and most similar to [7]. Since these are less directly relevant to our work, we refer the reader to [1] for a detailed discussion of the differences.

As we will see shortly, our main result disproves the above-mentioned conjecture of [1], and not only confirms the conjecture of [2] for the regime p12p\to\frac{1}{2}, but also significantly strengthens it by extending it to all p(0,12)p\in\big{(}0,\frac{1}{2}\big{)}.

II Optimal Rate Under Binary Symmetric Noise

In this section, we focus on the above-described setup in which both channels in the system are binary symmetric channels. We turn to more general binary-input discrete memoryless channels in Section III.

II-A Statement of Optimal Learning Rate

Our first main result is stated as follows.

Theorem 1.

Under the preceding setup, for any p,q(0,1/2)p,q\in\big{(}0,1/2\big{)}, let (p,q)\mathcal{R}^{*}(p,q) be the supremum of learning rates across all teaching and learning protocols. Then

(p,q)D(12max(p,q)),\mathcal{R}^{*}(p,q)\geq D\Big{(}\frac{1}{2}\,\Big{\|}\max(p,q)\Big{)}, (4)

where D(ab)=alnab+(1a)ln1a1bD(a\|b)=a\ln\frac{a}{b}+(1-a)\ln\frac{1-a}{1-b} denotes the binary relative entropy function (in base ee).

We achieve this result via a novel block-structured teaching strategy. In contrast with [2], the teacher transmits bits based on a single block at a time, ignoring earlier blocks. The teacher encodes the number of 1s received in the block by sending a sorted sequence (i.e., a string of 11s followed by a string of 0s), with the transition point chosen according to a carefully-chosen non-linear function of the number of 1s observed. The student performs maximum-likelihood decoding, and we study the error probability via the Bhattacharyya coefficient. The details are given in Sections II-B and II-C, and in Section II-D we show that the overall computation time required by the teacher and student is linear in nn.

Comparison to upper bound (converse). If q=0q=0, then the optimal learning rate is given by the majority decoder, which achieves a learning rate of [1, 2]

D(12p)=12ln(14p(1p)).D\Big{(}\frac{1}{2}\,\Big{\|}p\Big{)}=\frac{1}{2}\ln\left(\frac{1}{4p(1-p)}\right). (5)

Through data processing inequalities, we have (p,q)(p,0)=D(12p)\mathcal{R}^{*}(p,q)\leq\mathcal{R}^{*}(p,0)=D\big{(}\frac{1}{2}\|p\big{)}, and an analogous argument yields (p,q)D(12q)\mathcal{R}^{*}(p,q)\leq D\big{(}\frac{1}{2}\|q\big{)}. We therefore obtain the upper bound

(p,q)D(12max(p,q)).\mathcal{R}^{*}(p,q)\leq D\Big{(}\frac{1}{2}\,\Big{\|}\max(p,q)\Big{)}. (6)

Theorem 1 shows that this bound is tight, disproving the above-mentioned conjecture of [1], and both confirming and strengthening the conjecture of [2].

In the remainder of the section, we only consider the case p=qp=q. This is without loss of generality, since in the general case, one can simulate additional noise into the system to make the noise parameters both equal to max(p,q)\max(p,q).

II-B Teacher and Student Protocol

Before describing the protocol, we first need to introduce some statistical tools.

II-B1 Statistical Tools and Student Decoding Rule

Let AA and BB be discrete random variables defined on some finite alphabet Ω\Omega. The Bhattacharyya coefficient is a real number in [0,1][0,1] given by

ρ(A,B)=xΩ(A=x)(B=x).\rho(A,B)=\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)}. (7)

Overloading notation, we will sometimes also denote this quantity by ρ(PA,PB)\rho(P_{A},P_{B}), where PAP_{A} and PBP_{B} are the associated probability mass functions.

The Bhattacharyya coefficient measures the ‘closeness’ between the distributions of AA and BB. Values close to 0 indicate easily separated distributions, while values close to 1 indicate very similar distributions. In particular, ρ(A,B)=0\rho(A,B)=0 if and only if AA and BB have disjoint supports, and ρ(A,B)=1\rho(A,B)=1 if and only if they are identically distributed.

We will make use of some standard properties of the Bhattacharyya coefficient:

  • Suppose that we have a random variable XX which is known to follow one of the two known distributions, AA or BB. If we draw one instance of XX and use it to decide which distribution XX came from, then there exists a strategy to achieve an error probability of at most ρ(A,B)\rho(A,B). To achieve this, we simply use the maximum likelihood test: Given X=xX=x, we decide that AA is the true distribution if (A=x)>(B=x)\mathbb{P}(A=x)>\mathbb{P}(B=x), and decide that BB is the true distribution otherwise. The probability of selecting distribution AA when BB is indeed the true distribution is given by

    xΩ(B=x)𝟙(A=x)>(B=x)\displaystyle\sum_{x\in\Omega}\mathbb{P}(B=x)\mathbbm{1}_{\mathbb{P}(A=x)>\mathbb{P}(B=x)}
    xΩ(A=x)(B=x).\displaystyle\qquad\leq\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)}. (8)

    The other error type is upper bounded similarly, and combining the two error types yields the desired bound:

    (error)ρ(A,B).\mathbb{P}({\rm error})\leq\rho(A,B). (9)
  • If A1,A2A_{1},A_{2} are independent, and similarly for B1,B2B_{1},B_{2}, then

    ρ((A1,A2),(B1,B2))=ρ(A1,B1)ρ(A2,B2).\rho((A_{1},A_{2}),(B_{1},B_{2}))=\rho(A_{1},B_{1})\cdot\rho(A_{2},B_{2}). (10)

    This follows by a direct expansion of (7).

  • Let A=(A1,A2,,Am)\vec{A}=(A_{1},A_{2},\ldots,A_{m}) be mm i.i.d. copies of AA and B=B1,B2,,Bm\vec{B}={B_{1},B_{2},\ldots,B_{m}} be mm i.i.d. copies of BB. Then

    ρ(A,B)=ρ(A,B)m.\rho(\vec{A},\vec{B})=\rho(A,B)^{m}. (11)

    This is a direct consequence of (10).

In accordance with (9), we assume throughout the paper that the student adopts the optimal maximum-likelihood decoding rule, where the two underlying distributions are those of Θ=1\Theta=1 vs. Θ=0\Theta=0 given Z1,,ZnZ_{1},\dotsc,Z_{n}.

II-B2 Block-Structured Design and Teaching Strategy

The transmission protocol of length nn is broken down into n/kn/k blocks, each of length kk. For each i=1,2,3,,nk1i=1,2,3,\ldots,\frac{n}{k}-1, we let (X^ik+1,,X^(i+1)k)(\hat{X}_{ik+1},\ldots,\hat{X}_{(i+1)k}) be a (deterministic) function of (Y(i1)k+1,,Yik)(Y_{(i-1)k+1},\dotsc,Y_{ik}). The values of X^1,,X^k\hat{X}_{1},\ldots,\hat{X}_{k} are ignored by the student. The function used to generate X^{0,1}k\vec{\hat{X}}\in\{0,1\}^{k} from Y{0,1}k\vec{Y}\in\{0,1\}^{k} is given in Section II-B3.

Under this design, the ii-th block of X^\hat{X} only depends on the (i1)(i-1)-th block of YY; see Figure 2 for an illustration. To simplify notation, we define

Wi=(Zik+1,,,Z(i+1)k)W_{i}=(Z_{ik+1},,\ldots,Z_{(i+1)k}) (12)

to be the ii-th block received by student.

time1kk2k2k3k3k1kk2k2k3k3kteacherstudent
Figure 2: A diagrammatic representation of the block protocol

The memoryless properties of the channels imply that the WiW_{i}’s are conditionally independent given Θ\Theta. Since the function that generates X^\vec{\hat{X}} from Y\vec{Y} is the same every time, their distributions are also identical; we denote this common distribution by PWP_{W}.

The student needs to distinguish between n/k1n/k-1 i.i.d. copies of PW|Θ=1P_{W|\Theta=1} and n/k1n/k-1 i.i.d. copies of PW|Θ=0P_{W|\Theta=0}. Using (9) and (11), we have

(Θ^nΘ)ρ(PW|Θ=1,PW|Θ=0)n/k1.\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\leq\rho(P_{W|\Theta=1},P_{W|\Theta=0})^{n/k-1}. (13)

By setting kk to be a fixed constant as nn increases, it follows that the learning rate is lower bounded by

\displaystyle\mathcal{R} =lim supn{1nln(Θ^nΘ)}\displaystyle=\limsup_{n\rightarrow\infty}\left\{-\frac{1}{n}\ln\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\right\} (14)
1klnρ(PW|Θ=1,PW|Θ=0),\displaystyle\geq-\frac{1}{k}\ln\rho(P_{W|\Theta=1},P_{W|\Theta=0}), (15)

where WW implicitly depends on kk. In fact, while constant kk suffices for our purposes, a similar argument holds when kk increases but satisfies k=o(n)k=o(n), in which case the operation lim supk\limsup_{k\rightarrow\infty} should also be included on the right-hand side of (15).

Remark 1.

This block structure bears high-level similarity to block-Markov coding in the analysis of relay channels [8, Sec. 16.4]. However, the details here are very different from existing analyses that use joint typicality and related notions. The error exponent associated such analyses would be low due to the effective block length of kk, whereas we maintain a high error exponent via joint decoding over the entire length-nn received sequence.

II-B3 Protocol within Each Block

In each block, the teacher receives kk noisy bits (e.g., Y1,,YkY_{1},\dotsc,Y_{k} in the first block). Let α\alpha be the fraction of 1 among these bits (i.e. the number of 1s divided by kk).

The teacher sends kf(α)k\cdot f(\alpha) bits of 1 followed by k(1f(α))k(1-f(\alpha)) bits of 0 (with a delay of one block), where f:[0,1][0,1]f:[0,1]\rightarrow[0,1] is defined as

f(α)={00α<pD(αp)2D(1/2p)pα1/21f(1α)1/2<α<1p11pα1.f(\alpha)=\begin{cases}0&0\leq\alpha<p\\ \frac{D(\alpha\|p)}{2D(1/2\|p)}&p\leq\alpha\leq 1/2\\ 1-f(1-\alpha)&1/2<\alpha<1-p\\ 1&1-p\leq\alpha\leq 1.\end{cases} (16)

A sample plot is given in Figure 3 with p=0.2p=0.2. As exemplified in the figure, ff is non-decreasing, and

f(α)=1f(1α),\displaystyle f(\alpha)=1-f(1-\alpha), (17)
f(α)D(αp)2D(1/2p),\displaystyle f(\alpha)\leq\frac{D(\alpha\|p)}{2D(1/2\|p)}, (18)

where (18) is trivial for α12\alpha\leq\frac{1}{2}, and follows easily from the convexity of D(p)D(\cdot\|p) for α>12\alpha>\frac{1}{2}. Specifically, this convexity implies that the derivative of D(p)D(\cdot\|p) increases for α>12\alpha>\frac{1}{2}, whereas the chain rule applied to (17) implies that the derivative of f(α)f(\alpha) decreases for α>12\alpha>\frac{1}{2}. Since (18) holds with equality for α[p,12]\alpha\in\big{[}p,\frac{1}{2}\big{]} by definition, this behavior of the derivatives implies that (18) also holds for α>12\alpha>\frac{1}{2}.

α\alphaf(α)f(\alpha)1111
Figure 3: The function ff used in the protocol, plotted for p=0.2p=0.2. The dashed line represents f(α)f(\alpha), while the solid line represents D(αp)2D(1/2p)\frac{D(\alpha\|p)}{2D(1/2\|p)}.

In general kf(α)k\cdot f(\alpha) may not be an integer. When this happens, we can use kf(α)\lfloor k\cdot f(\alpha)\rfloor bits of 11 followed by kkf(α)k-\lfloor k\cdot f(\alpha)\rfloor bits of 0. The rounding does not affect the error analysis, so to simplify notation, we focus on the integer-valued case. See Footnote 1 on Page 1 for an additional remark on this issue.

Before proceeding with the student strategy and error analysis, we intuitively explain why sending a sorted sequence (i.e. with all the 1s sent before the 0s) can make decoding easier for the student. Suppose the student receives the string 111100000010. If the student knows in advance that the string sent by the teacher is sorted, they can be confident that the underlined bit is flipped. However, if this were to be an unsorted string, such corrections could not be made.

The reasoning behind our choice of ff will become apparent in the error analysis in the following subsection, but intuitively, it serves as a carefully-chosen middle ground between the simpler choices f(α)=αf(\alpha)=\alpha (bearing some resemblance to simple forwarding) and f(α)=𝟏{α>12}f(\alpha)=\boldsymbol{1}\{\alpha>\frac{1}{2}\} (bearing some resemblance to cumulative teaching).

II-C Error Analysis

Continuing the study of a single block WW defined according to (12), observe that we have the Markov chain

ΘαW,\Theta\rightarrow\alpha\rightarrow W, (19)

and in accordance with (15), we would like the distributions PW|Θ=1P_{W|\Theta=1} and PW|Θ=0P_{W|\Theta=0} to be as ‘far apart’ as possible. We break the error analysis into two parts, analyzing the relations Θα\Theta\rightarrow\alpha and αW\alpha\rightarrow W separately.

II-C1 Relating Θ\Theta and α\alpha

Define WαW_{\alpha} to follow the conditional distribution of WW given α\alpha. The distribution of WαW_{\alpha} is simple; recalling that we are focusing on the case p=qp=q, we have the following:

  • The bits of WαW_{\alpha} are independently distributed;

  • If ikf(α)i\leq k\cdot f(\alpha), then the ii-th bit is 1 with probability 1p1-p and 0 otherwise;

  • If i>kf(α)i>k\cdot f(\alpha), then the ii-th bit is 1 with probability pp and 0 otherwise.

We proceed by upper bounding ρ(Wα1,Wα2)\rho(W_{\alpha_{1}},W_{\alpha_{2}}), starting with the case that α1α2\alpha_{1}\leq\alpha_{2}.

The bits up to index kf(α1)k\cdot f(\alpha_{1}), and the bits after index kf(α2)k\cdot f(\alpha_{2}), are all identically distributed between Wα1W_{\alpha_{1}} and Wα2W_{\alpha_{2}}, and thus do not provide any distinguishing power. We are left with kf(α2)kf(α1)k\cdot f(\alpha_{2})-k\cdot f(\alpha_{1}) bits, which are i.i.d. according to either Bernoulli(pp) or Bernoulli(1p1-p).

The Bhattacharyya coefficient between Bernoulli(pp) and Bernoulli(1p1-p) can easily be computed to be 2p(1p)=eD(1/2p)2\sqrt{p(1-p)}=e^{-D(1/2\|p)}, and using (11), we obtain111If we were to explicitly incorporate rounding as discussed following (18), then this equation would be replaced by the slightly looser bound ρ(Wα1,Wα2)eD(1/2p)(kf(α2)kf(α1)1)\rho(W_{\alpha_{1}},W_{\alpha_{2}})\leq e^{-D(1/2\|p)\cdot(kf(\alpha_{2})-kf(\alpha_{1})-1)}. The subtraction of one in the exponent only amounts to multiplying the entire expression by a constant, which does not impact the resulting exponential decay rate.

ρ(Wα1,Wα2)=ekD(1/2p)(f(α2)f(α1))\rho(W_{\alpha_{1}},W_{\alpha_{2}})=e^{-kD(1/2\|p)\cdot(f(\alpha_{2})-f(\alpha_{1}))} (20)

To simplify the exponent, we use (17) and (18) to obtain

f(α1)D(α1p)2D(1/2p),\displaystyle f(\alpha_{1})\leq\frac{D(\alpha_{1}\|p)}{2D(1/2\|p)}, (21)
f(α2)=1f(1α2)1D(1α2p)2D(1/2p),\displaystyle f(\alpha_{2})=1-f(1-\alpha_{2})\geq 1-\frac{D(1-\alpha_{2}\|p)}{2D(1/2\|p)}, (22)

and hence,

f(α2)f(α1)1D(α2p)2D(1/2p)D(1α1p)2D(1/2p).f(\alpha_{2})-f(\alpha_{1})\geq 1-\frac{D(\alpha_{2}\|p)}{2D(1/2\|p)}-\frac{D(1-\alpha_{1}\|p)}{2D(1/2\|p)}. (23)

We can therefore weaken (20) to

ρ(Wα1,Wα2)ek(D(1/2p)D(α1p)/2D(1α2p)/2).\rho(W_{\alpha_{1}},W_{\alpha_{2}})\leq e^{-k\cdot(D(1/2\|p)-D(\alpha_{1}\|p)/2-D(1-\alpha_{2}\|p)/2)}. (24)

Although the assumption α1α2\alpha_{1}\leq\alpha_{2} was used in the derivation of (20), a trivial argument shows that when α1>α2\alpha_{1}>\alpha_{2}, the right-hand side of (20) is at least one (since ff is non-decreasing). Since the Bhattacharyya coefficient never exceeds one, it follows that (24) remains true in the case that α1>α2\alpha_{1}>\alpha_{2}.

II-C2 Relating α\alpha and WW

The next step is to decompose the distribution of WW into the simpler distributions WαW_{\alpha}. To do so, we use the following definition.

Definition 1.

Let A1,,AmA_{1},\ldots,A_{m} be probability distributions defined on a common finite probability space Ω\Omega and p1,,pmp_{1},\ldots,p_{m} be real numbers in [0,1][0,1] that sum to one. The mixture distribution AA described by [p1,A1;;pk,Ak][p_{1},A_{1};\ldots;p_{k},A_{k}] is defined as (A=x)=i=1kpi(Ai=x)\mathbb{P}(A=x)=\sum_{i=1}^{k}p_{i}\cdot\mathbb{P}(A_{i}=x).

We proceed with a simple technical lemma on the Bhattacharyya coefficient on mixtures.

Lemma 1.

Suppose that AA follows a mixture distribution described by [p1,A1;;pk,Ak][p_{1},A_{1};\ldots;p_{k},A_{k}]. Then for any distribution BB, we have

ρ(A,B)i=1kpiρ(Ai,B).\rho(A,B)\leq\sum_{i=1}^{k}\sqrt{p_{i}}\rho(A_{i},B). (25)
Proof.

Using the inequality a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}, we have

ρ(A,B)\displaystyle\rho(A,B) =xΩ(A=x)(B=x)\displaystyle=\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)} (26)
=xΩi=1kpi(Ai=x)(B=x)\displaystyle=\sum_{x\in\Omega}\sqrt{\sum_{i=1}^{k}p_{i}\mathbb{P}(A_{i}=x)\mathbb{P}(B=x)} (27)
xΩi=1kpi(Ai=x)(B=x).\displaystyle\leq\sum_{x\in\Omega}\sum_{i=1}^{k}\sqrt{p_{i}\mathbb{P}(A_{i}=x)\mathbb{P}(B=x)}. (28)

Simple re-arranging gives the desired result. ∎

Using the symmetry of the Bhattacharyya coefficient and applying Lemma 1 twice, we obtain

ρ(PW|Θ=1,PW|Θ=0)i=0kj=0k(α=i/k|Θ=1)×(α=j/k|Θ=0)ρ(Wi/k,Wj/k).\rho(P_{W|\Theta=1},P_{W|\Theta=0})\leq\sum_{i=0}^{k}\sum_{j=0}^{k}\sqrt{\mathbb{P}(\alpha=i/k|\Theta=1)}\\ \times\sqrt{\mathbb{P}(\alpha=j/k|\Theta=0)}\cdot\rho(W_{i/k},W_{j/k}). (29)

Moreover, the conditional distribution of α\alpha given Θ\Theta is simple:

(α|Θ=1)\displaystyle(\alpha|\Theta=1) 1kBinomial(k,1p),\displaystyle\sim\frac{1}{k}\cdot\text{Binomial}(k,1-p), (30)
(α|Θ=0)\displaystyle(\alpha|\Theta=0) 1kBinomial(k,p).\displaystyle\sim\frac{1}{k}\cdot\text{Binomial}(k,p). (31)

Hence, applying the Chernoff bound for the binomial distribution (e.g., see [9, Sec. 2.2]), we obtain

(α=α0|Θ=1)ekD(α01p)=ekD(1α0p),\mathbb{P}(\alpha=\alpha_{0}|\Theta=1)\leq e^{-kD(\alpha_{0}\|1-p)}=e^{-kD(1-\alpha_{0}\|p)}, (32)

and similarly,

(α=α0|Θ=0)ekD(α0p).\mathbb{P}(\alpha=\alpha_{0}|\Theta=0)\leq e^{-kD(\alpha_{0}\|p)}. (33)

Combining these bounds with (24) and (29), we find that the D(i/kp)/2D(i/k\|p)/2 and D(1j/kp)/2D(1-j/k\|p)/2 terms both cancel to zero, and we are left with

ρ(PW|Θ=1,PW|Θ=0)\displaystyle\rho(P_{W|\Theta=1},P_{W|\Theta=0}) i=0kj=0kekD(1/2p)\displaystyle\leq\sum_{i=0}^{k}\sum_{j=0}^{k}e^{-kD(1/2\|p)} (34)
=(k+1)2ekD(1/2p).\displaystyle=(k+1)^{2}e^{-kD(1/2\|p)}. (35)

Combining (15) with (35) gives

D(1/2p)𝒪(logkk),\mathcal{R}\geq D(1/2\|p)-\mathcal{O}\left(\frac{\log k}{k}\right), (36)

and hence, by setting kk large enough, we can obtain any learning rate arbitrarily close to D(1/2p)D(1/2\|p). This completes the proof of Theorem 1.

II-D Computational complexity

In this section, we argue that it is possible to execute the described protocol with an overall computation time of 𝒪(n)\mathcal{O}(n) (i.e., no higher than that of simply reading the received bits). This is immediate for the teacher, so we focus our attention on the student.

To implement the maximum likelihood decoder, it suffices to compute the associated log likelihood ratio (LLR) log(Z=z|Θ=1)(Z=z|Θ=0)\log\frac{\mathbb{P}(\vec{Z}=\vec{z}|\Theta=1)}{\mathbb{P}(\vec{Z}=\vec{z}|\Theta=0)} given the nn received bits z\vec{z}. In addition, since the n/k1n/k-1 blocks are independent by construction and memorylessness, the overall LLR is the sum of per-block LLRs. Hence, we only need to show that each of these can be computed in time 𝒪(k)\mathcal{O}(k).

The only minor difficulty in computing the LLR for a single block is that we need to account for all possible k+1k+1 choices of the latent/hidden variable α{0,1k,,1}\alpha\in\big{\{}0,\frac{1}{k},\dotsc,1\big{\}}: Letting WW denote a generic length-kk block according to (12), we have

(W|Θ=1)=α(α|Θ=1)(W|α),\mathbb{P}(W|\Theta=1)=\sum_{\alpha}\mathbb{P}(\alpha|\Theta=1)\mathbb{P}(W|\alpha), (37)

and similarly for (W|Θ=0)\mathbb{P}(W|\Theta=0). In addition, since the channel is memoryless, we have

(W|α)=i=1k(W(i)|α),\mathbb{P}(W|\alpha)=\prod_{i=1}^{k}\mathbb{P}(W(i)|\alpha), (38)

where W(i)W(i) is the ii-th bit comprising WW. Naively, we could evaluate (37) and (38) directly, but this would give a per-block complexity of 𝒪(k2)\mathcal{O}(k^{2}).

To improve this, we recall that for two values α1,α2{0,1k,,1}\alpha_{1},\alpha_{2}\in\big{\{}0,\frac{1}{k},\dotsc,1\big{\}}, the associated values of (W(i)|α)\mathbb{P}(W(i)|\alpha) only differ for values of ii in between kf(α1)k\cdot f(\alpha_{1}) and kf(α2)k\cdot f(\alpha_{2}). Therefore, we can initially compute (W|α)\mathbb{P}(W|\alpha) for α=0\alpha=0, and then use it to compute the value for α=1k\alpha=\frac{1}{k} by only looking at the first kf(1k)k\cdot f\big{(}\frac{1}{k}\big{)} bits of WW, and so on for α{2k,3k,,1}\alpha\in\big{\{}\frac{2}{k},\frac{3}{k},\dotsc,1\big{\}}; at each step, we only need to look at the bits affected by incrementing α\alpha. In this manner, it only takes 𝒪(k)\mathcal{O}(k) time to compute all k+1k+1 values of (W|α)\mathbb{P}(W|\alpha). In addition, (α|Θ=1)\mathbb{P}(\alpha|\Theta=1) in (37) follows a scaled binomial distribution, whose probability mass function can be pre-computed in 𝒪(k)\mathcal{O}(k) time (assuming constant-time arithmetic operations). Hence, we obtain an overall per-block computation time of 𝒪(k)\mathcal{O}(k).

In addition to the low computational overhead, another advantage of our strategy is that it is anytime. Specifically, it can be run without knowledge of nn, and we can stop the algorithm any time and ask the student for the estimate Θ^n\hat{\Theta}_{n}. Letting ϵk,n\epsilon_{k,n} be the resulting error probability with block size kk and total time nn, it still holds in this scenario that 1nlnϵk,nD(1/2||p)-\frac{1}{n}\ln\epsilon_{k,n}\rightarrow D(1/2||p), as long as kk\rightarrow\infty and n/kn/k\rightarrow\infty.

III General Binary-Input Discrete Memoryless Channels

In this section, we consider the same model as Section I-A, except that we replace BSC(pp) and BSC(qq) by general discrete memoryless channels (DMCs) PP and QQ (known to the student and teacher), both of which have binary inputs {0,1}\{0,1\}. We denote the transition probabilities by P(y|θ)P(y|\theta) and Q(z|x^)Q(z|\hat{x}). The case that QQ has non-binary input is discussed in Section IV.

We first overview the 1-hop case in Section III-A, following the classical work of Shannon, Gallager, and Berlekamp [10]. We then state an achievable learning rate in Section III-B, and give the protocol and analysis in Section III-C, which are essentially more technical variants of the binary symmetric case. We provide an upper bound on the learning rate (i.e., a converse) in Section III-D, and use it to deduce the optimal learning rate in certain special cases. While the tightness of our achievable learning rate remains open in general, we additionally show in Section III-E that, at least under certain technical assumptions, it cannot be improved for block-structured protocols.

III-A Existing Results on the 1-Hop Case

We first consider a 1-hop scenario in which a single agent makes repeated observations of the unknown variable Θ\Theta through repeated uses of a binary-input DMC PP. This is a well-studied problem, and we will summarize some of the results from [10].

We begin with a generalization of the Bhattacharyya coefficient. For a real number s(0,1)s\in(0,1) and two random variables A,BA,B defined on the same finite alphabet Ω\Omega, let

ρ(A,B,s)=xΩ(A=x)1s(B=x)s[0,1].\rho(A,B,s)=\sum_{x\in\Omega}\mathbb{P}(A=x)^{1-s}\mathbb{P}(B=x)^{s}\in[0,1]. (39)

where the upper bound of one follows by applying Jensen’s inequality to 𝔼A[(PB(B)PA(A))s]\mathbb{E}_{A}\big{[}\big{(}\frac{P_{B}(B)}{P_{A}(A)}\big{)}^{s}\big{]}. This function can be extended to s=0s=0 and s=1s=1 by continuity; for instance, ρ(A,B,0)=xΩ(A=x)𝟙(B=x)>0\rho(A,B,0)=\sum_{x\in\Omega}\mathbb{P}(A=x)\mathbbm{1}_{\mathbb{P}(B=x)>0}. We also note that setting s=12s=\frac{1}{2} recovers the Bhattacharyya coefficient. We write ρ(A,B,s)\rho(A,B,s) and ρ(PA,PB,s)\rho(P_{A},P_{B},s) interchangeably. Similarly to (10)–(11), we have the following properties for independent pairs and length-mm i.i.d. vectors:

ρ((A1,A2),(B1,B2),s)=ρ(A1,B1,s)ρ(A2,B2,s),\displaystyle\rho((A_{1},A_{2}),(B_{1},B_{2}),s)=\rho(A_{1},B_{1},s)\cdot\rho(A_{2},B_{2},s), (40)
ρ(A,B,s)=ρ(A,B,s)m.\displaystyle\rho(\vec{A},\vec{B},s)=\rho(A,B,s)^{m}. (41)

Given the DMC PP with inputs 0 and 11, we define

ρP(s)=ρ(P(|0),P(|1),s),\rho_{P}(s)=\rho(P(\cdot|0),P(\cdot|1),s), (42)

and denote its logarithm by222See Figure 4 on Page 4 for some examples of how μP(s)\mu_{P}(s) varies as a function of ss.

μP(s)=lnρP(s)0,\mu_{P}(s)=\ln\rho_{P}(s)\leq 0, (43)

where P(|0)P(\cdot|0) and P(|1)P(\cdot|1) are treated as probability distributions on the output alphabet. The first and second derivatives of μP(s)\mu_{P}(s) for s(0,1)s\in(0,1) are denoted by μP(s)\mu^{\prime}_{P}(s) and μP′′(s)\mu^{\prime\prime}_{P}(s), and once again, the values for the endpoints s{0,1}s\in\{0,1\} are defined with respect to the appropriate limits.

Suppose that the agent makes a single observation of Θ{0,1}\Theta\in\{0,1\} through PP, and uses it to estimate Θ\Theta. Using the identity 𝟙(B=x)>(A=x)((B=x)(A=x))s\mathbbm{1}_{\mathbb{P}(B=x)>\mathbb{P}(A=x)}\leq\big{(}\frac{\mathbb{P}(B=x)}{\mathbb{P}(A=x)}\big{)}^{s}, we find that for any s0s\geq 0, ρP(s)\rho_{P}(s) is an upper bound to the probability of a maximum-likelihood decision rule selecting BB when AA is true. An analogous property holds for the reverse error event with 1s1-s in place of ss, and it follows that ρP(s)\rho_{P}(s) is an upper bound to the overall error probability for all s[0,1]s\in[0,1].

In view of the property in (41), if the agent instead makes nn independent observations, the upper bound becomes ρPn(s)=ρP(s)n\rho_{P^{n}}(s)=\rho_{P}(s)^{n}, implying a learning rate of at least

1nlnρP(s)n=μP(s).-\frac{1}{n}\ln\rho_{P}(s)^{n}=-\mu_{P}(s). (44)

The following well-known result reveals that this learning rate is optimal upon optimizing over s[0,1]s\in[0,1].

Lemma 2.

[10, Corollary of Thm. 5]333This is a weakened version of the statement in [10], in which we apply max{s,1s}1\max\{s,1-s\}\leq 1 in the coefficient to the n\sqrt{n} term. In addition, we specialize to the case that every transmitted symbol is identical, in order to match our setup. Let s[0,1]s^{*}\in[0,1] be chosen to minimize μP(s)\mu_{P}(s). Then any decoding strategy of the agent after nn uses of the channel must have an error probability of at least 18exp(nμP(s)2nμP′′(s))\frac{1}{8}\exp\big{(}n\mu_{P}(s^{*})-\sqrt{2n\mu_{P}^{\prime\prime}(s^{*})}\big{)} when Θ\Theta is equiprobable on {0,1}\{0,1\}.

Since the 2nμP′′(s)\sqrt{2n\mu_{P}^{\prime\prime}(s^{*})} term is sub-linear in nn, this immediately gives the following corollary.

Corollary 1.

The optimal learning rate for the 1-hop system is given by mins[0,1]μP(s)-\min_{s\in[0,1]}\mu_{P}(s).

III-B Achievability Result

We now state the second main result of our paper, giving an achievability result for general binary-input DMCs.

Theorem 2.

Let PP and QQ be arbitrary binary-input discrete memoryless channels, and let (P,Q)\mathcal{R}^{*}(P,Q) be the supremum of learning rates across all teaching and learning protocols. Then

(P,Q)mins[0,1]max(μP(s),μQ(s)).\mathcal{R}^{*}(P,Q)\geq-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)). (45)

Moreover, this can be improved to

(P,Q)mins[0,1]max(μP(s),min(μQ(s),μQ(1s))).\mathcal{R}^{*}(P,Q)\geq-\min_{s\in[0,1]}\max\big{(}\mu_{P}(s),\min(\mu_{Q}(s),\mu_{Q}(1-s))\big{)}. (46)

While (46) clearly implies (45), we find the expression (45) easier to work with and prove. Once (45) is proved, (46) follows from the fact that one can choose to flip the inputs of QQ, which amounts to replacing μQ(s)\mu_{Q}(s) by μQ(1s)\mu_{Q}(1-s).444At first this leads to the seemingly different learning rate of min(mins[0,1]max(μP(s),μQ(s)),mins[0,1]max(μP(s),μQ(1s)))-\min\big{(}\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)),\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(1-s))\big{)}, but then (46) can be recovered by taking the two mins[0,1]\min_{s\in[0,1]} outside the outer minimum (swapping minimization order has no effect), and applying the identity max(a,min(b,c))=min(max(a,b),max(a,c))\max(a,\min(b,c))=\min(\max(a,b),\max(a,c)) (verified by checking all 6 orderings of aa, bb, and cc).

When the channels PP and QQ are identical, (45) reduces to the expression in Corollary 1. Since the learning rate cannot be smaller than the 1-hop case (due to a data-processing argument similar to the binary symmetric setting), we conclude that Theorem 46 provides the optimal learning rate in this case. Other scenarios in which Theorem 46 gives the optimal learning rate will be discussed in Section III-D.

III-C Protocol and Analysis for the Achievability Result

Fix an arbitrary value of s¯[0,1]\bar{s}\in[0,1] and define

μmax=max(μP(s¯),μQ(s¯)).\mu_{\max}=\max(\mu_{P}(\bar{s}),\mu_{Q}(\bar{s})). (47)

To prove (45), it suffices to show that μmax-\mu_{\max} is an achievable learning rate. We assume without loss of generality that μmax<0\mu_{\max}<0, since a learning rate of zero is trivial.

We again adopt the block structure used in Section II, with kk denoting the block size. Suppose that in a single block, the teacher receives Y=(Y1,Y2,,Yk)\vec{Y}=(Y_{1},Y_{2},\ldots,Y_{k}). We define the log-likelihood ratio (LLR) as

l(Y)=i=1klnP(Yi|1)P(Yi|0),l(\vec{Y})=\sum_{i=1}^{k}\ln\frac{P(Y_{i}|1)}{P(Y_{i}|0)}, (48)

and let L1L_{1} (respectively, L0L_{0}) be the distribution of l(Y)l(\vec{Y}) under Θ=1\Theta=1 (respectively, Θ=0\Theta=0).

Upon receiving Y\vec{Y}, the teacher computes l=l(Y)l=l(\vec{Y}), and sends g(l)g(l) bits of 0 followed by kg(l)k-g(l) bits of 11, where g(l)g(l) is defined as follows:

g(l)={min(k,1s¯μmaxln(L0l))l0max(0,ks¯μmaxln(L1l))l>0g(l)=\begin{cases}\min(k,\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l))&l\leq 0\\ \max(0,k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l))&l>0\end{cases} (49)

Since 1s¯01-\bar{s}\geq 0, μmax<0\mu_{\max}<0, and ln(L0l)0\ln\mathbb{P}(L_{0}\geq l)\leq 0, we have

1s¯μmaxln(L0l)0,\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq 0, (50)

and similarly,

s¯μmaxln(L1l)0,\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l)\geq 0, (51)

which implies that 0g(l)k0\leq g(l)\leq k for all ll. It may be necessary to round g(l)g(l) to the nearest integer, but similarly to Section II, this does not affect the result, and it is omitted in our analysis.

The main difference in the analysis compared to Section II is establishing the following technical lemma.

Lemma 3.

Under the preceding setup and definitions, we have for all ll that

1s¯μmaxln(L0l)ks¯μmaxln(L1l).\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l). (52)
Proof.

The proof is based on probabilistic bounds on log-likelihood ratios given in [10], along with algebraic manipulations that are elementary but rather technical; the details are given in Appendix -A. ∎

We claim that this result implies

1s¯μmaxln(L0l)g(l)ks¯μmaxln(L1l).\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq g(l)\geq k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l). (53)

To see this, we check the possible cases in the definition of g(l)g(l) in (49). If l0l\leq 0, then the left inequality in (53) is trivial, and the right inequality follows by considering two sub-cases: If the min(k,)\min(k,\cdot) in (49) is attained by kk, then we apply (51), and otherwise, we apply (52). The case l>0l>0 is handled similarly.

From here, the analysis is similar to that of Section II. Observe that we have the Markov chain Θl(Y)W\Theta\rightarrow l(\vec{Y})\rightarrow W, where WW denotes the kk symbols received by the student in a single block. Since WW takes on two different distributions depending on whether Θ=1\Theta=1 or Θ=0\Theta=0, we may also view WW as the output of a ‘single-use’ discrete memoryless channel with input Θ\Theta and an output of length kk. With a slight abuse of notation, we define the quantities μW(s)\mu_{W}(s) and ρW(s)\rho_{W}(s) according to this channel.

To upper bound ρW(s)\rho_{W}(s), we use the following straightforward generalization of Lemma 1.

Lemma 4.

Suppose that AA follows a mixture distribution described by [p1,A1;;pk,Ak][p_{1},A_{1};\ldots;p_{k},A_{k}]. Then for any distribution BB, we have

ρ(A,B,s)i=1kpi1sρ(Ai,B,s),\rho(A,B,s)\leq\sum_{i=1}^{k}p_{i}^{1-s}\rho(A_{i},B,s), (54)

and

ρ(B,A,s)i=1kpisρ(B,Ai,s).\rho(B,A,s)\leq\sum_{i=1}^{k}p_{i}^{s}\rho(B,A_{i},s). (55)

The proof of this lemma is identical to that of Lemma 1, except that a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} is replaced by the more general (a+b)sas+bs(a+b)^{s}\leq a^{s}+b^{s} for s[0,1]s\in[0,1] (and similarly for 1s1-s).

Decomposing each distribution PW|Θ=1P_{W|\Theta=1} and PW|Θ=0P_{W|\Theta=0} as a mixture based on the (finitely many) possible values of ll, and applying Lemma 55 twice, we obtain

ρW(s¯)\displaystyle\rho_{W}(\bar{s}) l0,l1(L0=l0)1s¯(L1=l1)s¯ρ(PW|l0,PW|l1,s¯)\displaystyle\leq\sum_{l_{0},l_{1}}\mathbb{P}(L_{0}=l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}=l_{1})^{\bar{s}}\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s}) (56)
l0,l1(L0l0)1s¯(L1l1)s¯ρ(PW|l0.PW|l1,s¯).\displaystyle\leq\sum_{l_{0},l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W|l_{0}}.P_{W|l_{1}},\bar{s}). (57)

To help simplify this expression, we use the following consequence of (53) (again recalling μmax<0\mu_{\max}<0):

(L0l0)1s¯(L1l1)s¯exp(g(l0)μmax)exp((kg(l1))μmax).\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\\ \leq\exp(g(l_{0})\cdot\mu_{\max})\exp((k-g(l_{1}))\cdot\mu_{\max}). (58)

Suppose first that l0l1l_{0}\leq l_{1}. Since gg is an increasing function and only the bits in between positions g(l0)g(l_{0}) and g(l1)g(l_{1}) differ, we have

ρ(PW|l0,PW|l1,s¯)\displaystyle\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s}) =exp(μQ(s¯)(g(l1)g(l0)))\displaystyle=\exp(\mu_{Q}(\bar{s})\cdot(g(l_{1})-g(l_{0}))) (59)
exp(μmax(g(l1)g(l0))),\displaystyle\leq\exp(\mu_{\max}\cdot(g(l_{1})-g(l_{0}))), (60)

noting that each differing bit contributes eμQ(s¯)eμmaxe^{\mu_{Q}(\bar{s})}\leq e^{\mu_{\max}} to the total. Combining (58) and (60), the terms containing g()g(\cdot) cancel, and we are left with

l0l1(L0l0)1s¯(L1l1)s¯ρ(PW|l0,PW|l1,s¯)l0l1exp(kμmax).\sum_{l_{0}\leq l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s})\\ \leq\sum_{l_{0}\leq l_{1}}\exp(k\cdot\mu_{\max}). (61)

On the other hand, if l0>l1l_{0}>l_{1}, then the increasing property of gg gives

(L0l0)1s¯(L1l1)s¯\displaystyle\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}
exp(g(l0)μmax)exp((kg(l1))μmax)\displaystyle\quad\leq\exp(g(l_{0})\cdot\mu_{\max})\exp((k-g(l_{1}))\mu_{\max}) (62)
exp(kμmax),\displaystyle\quad\leq\exp(k\cdot\mu_{\max}), (63)

and combining this with the fact that ρ1\rho\leq 1, we obtain

l0>l1(L0l0)1s¯(L1l1)s¯ρ(PW|l0,PW|l1,s¯)l0>l1exp(kμmax).\sum_{l_{0}>l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s})\\ \leq\sum_{l_{0}>l_{1}}\exp(k\cdot\mu_{\max}). (64)

By the definition of l(Y)l(\vec{Y}) in (48), the value of ll only depends on the number of occurrences of each output symbol in the block of length kk, and not the specific order. Hence, with a finite output alphabet 𝒴\mathcal{Y}, the number of possible ll values is upper bounded by (k+1)|𝒴|(k+1)^{|\mathcal{Y}|} [11, Ch. 2], and we conclude from (57), (61) and (64) that

μW(s¯)=lnρW(s¯)kμmax+𝒪(logk).\mu_{W}(\bar{s})=\ln\rho_{W}(\bar{s})\leq k\cdot\mu_{\max}+\mathcal{O}(\log k). (65)

As with (15), this implies that the overall learning rate \mathcal{R} satisfies

1kμW(s¯)=μmax+𝒪(logkk).\mathcal{R}\geq-\frac{1}{k}\mu_{W}(\bar{s})=-\mu_{\max}+\mathcal{O}\left(\frac{\log k}{k}\right). (66)

By setting kk large enough, we can obtain any rate arbitrarily close to μmax-\mu_{\max}, completing the proof of Theorem 46.

III-D A Converse Result and Its Consequences

As hinted following Theorem 46, by data processing arguments, we cannot obtain a learning rate greater than

max(mins[0,1]μP(s),mins[0,1]μQ(s)),-\max\Big{(}\min_{s\in[0,1]}\mu_{P}(s),\min_{s\in[0,1]}\mu_{Q}(s)\Big{)}, (67)

and this matches Theorem 46 when P=QP=Q. More generally, we can immediately deduce that Theorem 46 gives the optimal learning rate whenever QQ is a degraded version of PP (i.e., there exists an auxiliary channel UU such that composing PP with UU gives an overall channel with the same transition law as QQ), or vice versa. This is because the teacher (respectively, student) can artificially introduce extra noise to reduce to the case that P=QP=Q.

In this section, we give an additional upper bound on the learning rate (i.e., converse) that improves on the simple one in (67), and establishes the tightness of Theorem 46 in certain cases beyond those mentioned above. To state this converse, we introduce an additional definition. Momentarily departing from the teaching/learning setup, consider the following two-round 1-hop communication scenario defined in terms of the DMCs P,QP,Q and a parameter γ[0,1]\gamma\in[0,1]:

  • The sender seeks to communicate one of two messages, Θ{0,1}\Theta\in\{0,1\}, each occurring with probability 12\frac{1}{2};

  • In the first round of communication, the sender transmits γn\gamma n symbols via independent uses of PP without feedback;

  • After this first round, the sender observes the corresponding γn\gamma n output symbols via a noiseless feedback link;

  • In the second round of communication, the sender transmits (1γ)n(1-\gamma)n symbol via independent uses of QQ, with no further feedback.

This is a somewhat unconventional setup with independent and non-identical channel uses, as well as partial feedback. We let E2(P,Q,γ)E_{2}(P,Q,\gamma) denote the best possible error exponent under this setup as nn\to\infty, with the subscript 2 representing its two-round nature.

Theorem 3.

Let PP and QQ be arbitrary binary-input discrete memoryless channels, and let (P,Q)\mathcal{R}^{*}(P,Q) be the supremum of learning rates across all teaching and learning protocols. Then, for any γ[0,1]\gamma\in[0,1], we have

(P,Q)E2(P,Q,γ).\mathcal{R}^{*}(P,Q)\leq E_{2}(P,Q,\gamma). (68)
Proof.

Consider a “genie-aided” setup in which the teacher and student are given the following additional information:

  • (i)

    The student can observe the input X^\hat{X} (rather than only the output ZZ) of the teacher-student channel QQ from time 11 up to γn\gamma n.

  • (ii)

    The teacher is given the true value of Θ\Theta after time γn\gamma n.

Since the teacher and student can always choose to ignore this additional information, any upper bound on the learning rate in this genie-aided setting is also an upper bound in the original setting.

The theorem follows by noting that this genie-aided setup can be viewed as an instance of the above two-round communication setup:

  • For the first γn\gamma n symbols, once the student is given X^\hat{X}, no further information is provided by ZZ (a “degraded” version of X^\hat{X}), and the student’s first γn\gamma n observed symbols correspond to repeatedly observing Θ\Theta through the channel PP without feedback.

  • Similarly, for the last (1γ)n(1-\gamma)n symbols, once the teacher is given Θ\Theta, no further information is provided by YY, but the value of Θ\Theta still needs to be conveyed to the student via the uses of QQ. Hence, the student’s final (1γ)n(1-\gamma)n observed symbols correspond to using QQ with no further feedback, but with the first γn\gamma n outputs being available at the sender (teacher).

In the remainder of this subsection, we connect the converse result of Theorem 3 with the achievability result of Theorem 46 in two different ways:

  • Under a certain technical assumption (Assumption 1 below), we show that there exists γ[0,1]\gamma\in[0,1] such that the exponent in Theorem 46 equals E1(P,Q,γ)E_{1}(P,Q,\gamma), which is defined in the same way as E2(P,Q,γ)E_{2}(P,Q,\gamma) but with no feedback being available in the communication setup. Hence, the achievability and converse only differ with respect to the availability of feedback.

  • In the case that QQ is a BSC, we show that Theorems 46 and 3 coincide for all PP, thus establishing the optimal learning rate.

Starting with the former, we consider the following technical assumption.

Refer to caption
Figure 4: Example comparisons of μP(s)\mu_{P}(s) vs. μP(1s)\mu_{P}(1-s) for binary-input binary-output channels. We fix P(0|1)=0.2P(0|1)=0.2, and vary P(1|0)P(1|0). We observe that for every curve shown, it either holds that μP(s)μP(1s)\mu_{P}(s)\leq\mu_{P}(1-s) for all s[0,12]s\in[0,\frac{1}{2}], or μP(s)μP(1s)\mu_{P}(s)\geq\mu_{P}(1-s) for all s[0,12]s\in[0,\frac{1}{2}].
Assumption 1.

The DMCs PP and QQ satisfy the property that for all s[0,12]s\in\big{[}0,\frac{1}{2}\big{]}, it holds that μP(s)μP(1s)\mu_{P}(s)\leq\mu_{P}(1-s) and μQ(s)μQ(1s)\mu_{Q}(s)\leq\mu_{Q}(1-s).

This assumption states that values of s[0,12]s\in\big{[}0,\frac{1}{2}\big{]} are uniformly “better” (μ\mu is more negative) than their counterparts flipped about 12\frac{1}{2}. The opposite scenario (i.e., μ(s)μ(1s)\mu(s)\geq\mu(1-s) for all s[0,12]s\in\big{[}0,\frac{1}{2}\big{]}) can be handled similarly, or can more simply be viewed as reducing to Assumption 1 upon swapping the two inputs. Up to such swapping, we were unable to find any binary-input binary-output channel for which Assumption 1 fails, and even in the multiple-output case, we found that it is difficult to find counter-examples. As two concrete examples, the BSC trivially satisfies μ(s)=μ(1s)\mu(s)=\mu(1-s), and the reverse Z-channel555This is the regular Z-channel with the inputs swapped. with parameter q(0,1)q\in(0,1) yields the increasing function μ(s)=(1s)lnq\mu(s)=(1-s)\ln q. See Figure 4 for further illustrative numerical examples with binary output.

Lemma 5.

For any binary-input DMCs PP and QQ satisfying Assumption 1, we have

minγ[0,1]E1(P,Q,γ)=mins[0,1]max(μP(s),μQ(s))),\min_{\gamma\in[0,1]}E_{1}(P,Q,\gamma)=-\min_{s\in[0,1]}\max\big{(}\mu_{P}(s),\mu_{Q}(s))\big{)}, (69)

where E1(P,Q,γ)E_{1}(P,Q,\gamma) is defined in the same way as E2(P,Q,γ)E_{2}(P,Q,\gamma) above, but with no feedback being available in the communication setup.

Proof.

See Appendix -B

Observe that (69) precisely equals the learning rate derived in Theorem 46, and that we would match the converse in Theorem 3 if we could additionally show that E1=E2E_{1}=E_{2}. Unfortunately, even the single round of feedback can strictly increase the error exponent (see Appendix -C for an example), meaning that the optimal learning rate remains unclear. On the positive side, it turns out that E1=E2E_{1}=E_{2} whenever QQ is a BSC, and in fact, in this case, we can establish that Theorem 46 gives the optimal learning rate even when Assumption 1 is dropped. Formally, we have the following.

Lemma 6.

For any binary-input DMC PP, if QQ is a BSC with crossover probability q(0,12)q\in\big{(}0,\frac{1}{2}\big{)}, then the achievable learning rate in (45) is tight, i.e., (P,Q)=mins[0,1]max(μP(s),μQ(s))\mathcal{R}^{*}(P,Q)=-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)).

Proof.

See Appendix -D. ∎

Using Lemma 6 as a starting point, it is straightforward to establish that Theorem 3 can indeed strictly improve on the trivial converse in (67).

Corollary 2.

There exist binary-input DMCs PP and QQ such that the optimal learning rate is strictly smaller than max(mins[0,1]μP(s),mins[0,1]μQ(s))-\max\big{(}\min_{s\in[0,1]}\mu_{P}(s),\min_{s\in[0,1]}\mu_{Q}(s)\big{)}.

Proof.

This follows by letting PP be a (reverse) Z-channel and QQ be a BSC, with suitably-chosen parameters. See Appendix -E for the details. ∎

III-E A Converse for Block-Structured Protocols

While we have established several special cases where Theorem 46 give the optimal learning rate, it remains open as to whether this is true in general. As a final result, we complement our general converse (Theorem 3) with a restricted converse that only applies to block-structured protocols. Specifically, we say that a teaching strategy is block structured with length kk if it follows the structure in Figure 2: For any positive integer ii, the bits transmitted at indices ki+1,,k(i+1)ki+1,\dotsc,k(i+1) are only allowed to depend on the bits received at indices k(i1)+1,,kik(i-1)+1,\dotsc,ki.

Theorem 4.

For any PP and QQ satisfying Assumption 1, and any protocol such that the teaching strategy is block structured with length k=o(n)k=o(n), the learning rate is at most mins[0,1]max(μP(s),μQ(s))-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)).

Proof.

See Appendix -F. ∎

This result serves as evidence that Theorem 46 may provide the optimal learning rate under Assumption 1, or at least that to have any hope of improving, the teaching strategy should have some form of long-term memory.

IV Conclusion and Discussion

We have established optimal learning rates for the problem of teaching and learning a single bit under binary symmetric noise, using a simple and computationally efficient block-structured strategy. We have also adapted this technique to general binary-input DMCs.

As discussed above, while the optimal learning rate for general binary-input DMCs follows in several cases of interest, it remains unknown in general, leaving us with the following open problem.

Open Problem 1.

Does (46) give the optimal learning rate for arbitrary binary-input DMCs PP and QQ?

An important step towards answering this question would be removing Assumption 1 in the results that currently make use of it. Also regarding this assumption, we expect that it should hold true for all binary-input channels, and it would be of interest to establish whether this is indeed the case.

Another possible direction is to increase the input alphabet size of QQ, removing the assumption of binary inputs. In the 1-hop case, having a channel with more inputs does not improve the learning rate beyond choosing the two best inputs for the channel [10]. By a similar idea, we can immediately deduce an achievability result via Theorem 46, but we are left with the following open problem.

Open Problem 2.

Does there exist a pair of DMCs P,QP,Q, such that every restriction of QQ to two inputs gives a strictly suboptimal learning rate?

In addition to these open problems, interesting directions may include the case that the channel transition laws are unknown to the teacher and/or student, and the case of teaching more than a single bit.

-A Proof of Lemma 3 (Technical Result for General Binary-Input DMCs)

We first state an auxiliary result from [10]. When the agent seeks to decide between the alternatives Θ=1\Theta=1 and Θ=0\Theta=0, the natural decision rule is to set a threshold on the corresponding log-likelihood ratio. The following result bounds the probability that this log-likelihood ratio test fails under a given threshold.

Lemma 7.

[10, Proof of Thm. 5] Let PP be a binary-input DMC, and for each output symbol yy, define

l(y)=lnP(y|1)P(y|0)l(y)=\ln\frac{P(y|1)}{P(y|0)} (70)

to be the log-likelihood ratio between the two conditional distributions. Let L1L_{1} and L0L_{0} follow the distribution of l(Y)l(Y) under YP(|1)Y\sim P(\cdot|1) and YP(|0)Y\sim P(\cdot|0) respectively. For all s(0,1)s\in(0,1), we have

(L0μP(s))exp[μP(s)sμP(s)],\mathbb{P}(L_{0}\geq\mu_{P}^{\prime}(s))\leq\exp[\mu_{P}(s)-s\mu_{P}^{\prime}(s)], (71)

and

(L1μP(s))exp[μP(s)+(1s)μP(s)].\mathbb{P}(L_{1}\leq\mu_{P}^{\prime}(s))\leq\exp[\mu_{P}(s)+(1-s)\mu_{P}^{\prime}(s)]. (72)

We now proceed with the proof of Lemma 3. First suppose that ll satisfies

lims0+μP(s)llims1μP(s).\lim_{s\rightarrow 0^{+}}\mu^{\prime}_{P}(s)\leq l\leq\lim_{s\rightarrow 1^{-}}\mu^{\prime}_{P}(s). (73)

Then, since μP(s)\mu^{\prime}_{P}(s) is continuous and non-decreasing [10], there exists t[0,1]t\in[0,1] such that l=μP(t)l=\mu_{P}^{\prime}(t), and by (71) and (72) (applied to the kk-fold product distribution), we obtain

ln(L0l)k(μP(t)tμP(t)),\ln\mathbb{P}(L_{0}\geq l)\leq k(\mu_{P}(t)-t\mu_{P}^{\prime}(t)), (74)

and

ln(L1l)k(μP(t)+(1t)μP(t)).\ln\mathbb{P}(L_{1}\leq l)\leq k(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)). (75)

On the other hand, if

l<lims0+μP(s),l<\lim_{s\rightarrow 0^{+}}\mu^{\prime}_{P}(s), (76)

then we may set t=0t=0, and (74) still holds due to the fact that μP(0)=0\mu_{P}(0)=0. In this case, we have

ln(L1l)ln(L1lims0+μP(s))(a)kμP(0),\ln\mathbb{P}(L_{1}\leq l)\leq\ln\mathbb{P}\Big{(}L_{1}\leq\lim_{s\rightarrow 0^{+}}\mu_{P}^{\prime}(s)\Big{)}\stackrel{{\scriptstyle(a)}}{{\leq}}k\mu_{P}^{\prime}(0), (77)

where (a) follows from fact that (75) holds when l=μP(t)l=\mu_{P}^{\prime}(t). It follows that (75) also holds here with t=0t=0. By a similar argument, if l>lims1μP(s)l>\lim_{s\rightarrow 1^{-}}\mu^{\prime}_{P}(s), then we may set t=1t=1, and (74) and (75) remain valid. Therefore, for every value of ll (possibly ±\pm\infty), there exists some t[0,1]t\in[0,1] such that equations (74) and (75) hold.

Since μP\mu_{P} is convex (see [10, Thm. 5]), μP(s¯)\mu_{P}(\bar{s}) is lower bounded by its tangent line approximation at tt. This implies

μP(t)+(s¯t)μP(t)μP(s¯)μmax,\mu_{P}(t)+(\bar{s}-t)\mu_{P}^{\prime}(t)\leq\mu_{P}(\bar{s})\leq\mu_{\max}, (78)

which re-arranges to give

(1s¯)(μP(t)tμP(t))μmaxs¯(μP(t)+(1t)μP(t)).(1-\bar{s})(\mu_{P}(t)-t\mu_{P}^{\prime}(t))\leq\mu_{\max}-\bar{s}(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)). (79)

Multiplying through by k>0k>0 and dividing through by μmax<0\mu_{\max}<0, we obtain

1s¯μmaxk(μP(t)tμP(t))ks¯μmaxk(μP(t)+(1t)μP(t)).\frac{1-\bar{s}}{\mu_{\max}}k(\mu_{P}(t)-t\mu_{P}^{\prime}(t))\geq k-\frac{\bar{s}}{\mu_{\max}}k(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)). (80)

Finally, further upper bounding the left-hand side via (74) and further lower bounding the right-hand side via (75), we deduce the desired inequality in (52).

-B Proof of Lemma 5 (1-Hop Exponent Without Feedback)

Let x\vec{x} and x\vec{x}^{\prime} be the two codewords in the 1-hop communication problem without feedback, let y\vec{y} denote the received sequence, and define the nn-letter version of μ\mu as

μn,γ(s)=lny(Y=y|X=x)1s(Y=y|X=x)s.\mu_{n,\gamma}(s)=\ln\sum_{\vec{y}}\mathbb{P}(\vec{Y}=\vec{y}|\vec{X}=\vec{x}^{\prime})^{1-s}\mathbb{P}(\vec{Y}=\vec{y}\,|\,\vec{X}=\vec{x})^{s}. (81)

Using the classical analysis of Shannon, Gallager, and Berlekamp [10, Thm. 5] (see also Lemma 2 regarding the case P=QP=Q), the error exponent attained by optimal (maximum-likelihood) decoding is [10, Thm. 5]666The μn′′(s)\sqrt{\mu^{\prime\prime}_{n}(s)} term in [10, Thm. 5] scales as O(n)O(\sqrt{n}) due to the additive property of μ\mu for product distributions, and thus, it does not affect the error exponent.

limn1nmins[0,1]μn(s),\lim_{n\to\infty}-\frac{1}{n}\min_{s\in[0,1]}\mu_{n}(s), (82)

and accordingly, we proceed by analyzing μn(s)\mu_{n}(s).

Without loss of optimality, we can assume that x\vec{x} and x\vec{x}^{\prime} differ in every entry. Let βP\beta_{P} be the fraction of uses of PP for which x\vec{x} takes value 11 (and hence x\vec{x}^{\prime} takes value 0), and define βQ\beta_{Q} analogously. Then, the additive property of μ\mu for product distributions (by taking the logarithm in (40)) gives

μn,γ(s)=βPγnμP(s)+(1βP)γnμP(1s)+βQ(1γ)nμQ(s)+(1βQ)(1γ)nμQ(1s),\mu_{n,\gamma}(s)=\beta_{P}\gamma n\mu_{P}(s)+(1-\beta_{P})\gamma n\mu_{P}(1-s)\\ +\beta_{Q}(1-\gamma)n\mu_{Q}(s)+(1-\beta_{Q})(1-\gamma)n\mu_{Q}(1-s), (83)

noting that by the definition of μP\mu_{P}, swapping the two inputs amounts to replacing ss by 1s1-s.

To further simplify (83), we define

μ=mins[0,1]max(μP(s),μQ(s)),\displaystyle\mu^{*}=\min_{s\in[0,1]}\,\max(\mu_{P}(s),\mu_{Q}(s)), (84)
s=argmins[0,1]max(μP(s),μQ(s)),\displaystyle s^{*}=\operatorname*{arg\,min}_{s\in[0,1]}\,\max(\mu_{P}(s),\mu_{Q}(s)), (85)

and introduce the shorthand

μ~n,γ(s)=γnμP(s)+(1γ)nμQ(s).\tilde{\mu}_{n,\gamma}(s)=\gamma n\mu_{P}(s)+(1-\gamma)n\mu_{Q}(s). (86)

We first provide a lemma lower bounding μ~n,γ(s)\tilde{\mu}_{n,\gamma}(s), and then show that μ~n,γ(s)\tilde{\mu}_{n,\gamma}(s) is itself a lower bound on μn,γ(s)\mu_{n,\gamma}(s).

Lemma 8.

There exists a choice of γ[0,1]\gamma\in[0,1] such that for all s[0,1]s\in[0,1], it holds that μ~n,γ(s)nμ\tilde{\mu}_{n,\gamma}(s)\geq n\mu^{*}.

Proof.

We split the proof into several cases. Throughout, it should be kept in mind that μP(s)\mu_{P}(s) is a continuous and convex function [10].

Case 1: If μP(s)μQ(s)\mu_{P}(s^{*})\geq\mu_{Q}(s^{*}) and μP\mu_{P} attains its global minimum777Here “global minimum” is meant with respect to the restricted domain s[0,1]s\in[0,1]. at ss^{*}, then we can set γ=1\gamma=1 to get

μ~n,γ(s)\displaystyle\tilde{\mu}_{n,\gamma}(s) =nμP(s)nμP(s)\displaystyle=n\mu_{P}(s)\geq n\mu_{P}(s^{*})
=nmax(μP(s),μQ(s))=nμ.\displaystyle=n\cdot\max(\mu_{P}(s^{*}),\mu_{Q}(s^{*}))=n\mu^{*}. (87)

A similar argument holds when μQ(s)μP(s)\mu_{Q}(s^{*})\geq\mu_{P}(s^{*}) and μQ\mu_{Q} attains its global minimum at ss^{*} (setting γ=0\gamma=0).

Case 2: If μP(s)>μQ(s)\mu_{P}(s^{*})>\mu_{Q}(s^{*}) and μP\mu_{P} does not attain its global minimum at ss^{*}, then we can shift ss^{*} towards PP’s minimizer to decrease max(μP(s),μQ(s))\max(\mu_{P}(s),\mu_{Q}(s)), contradicting the definition of ss^{*}. Hence, this case does not occur. A similar finding applies to the case that μQ(s)>μP(s)\mu_{Q}(s^{*})>\mu_{P}(s^{*}) and μQ\mu_{Q} does not attain its global minimum at ss^{*}

Case 3: We claim that the above cases only leave the case that μP(s)=μQ(s)\mu_{P}(s^{*})=\mu_{Q}(s^{*}) and neither μP\mu_{P} nor μQ\mu_{Q} attain their global minimum at ss^{*}. To see this, note that if μP(s)μQ(s)\mu_{P}(s^{*})\neq\mu_{Q}(s^{*}), then the larger one attaining its global minimum would give Case 1, whereas the larger one not attaining its global minimum would give Case 2. Hence, we may assume that μP(s)=μQ(s)\mu_{P}(s^{*})=\mu_{Q}(s^{*}). Then, we can additionally assume that neither μP\mu_{P} nor μQ\mu_{Q} attain their global minimum at ss^{*}, since otherwise we would again be in Case 1. We proceed with two further sub-cases.

Case 3a: If μP(s)μQ(s)>0\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})>0, then we can shift ss^{*} by a small amount and decrease max(μP(s),μQ(s))\max(\mu_{P}(s),\mu_{Q}(s)), contradicting the fact that ss^{*} is the minimizer (unless we are at an endpoint of the interval [0,1][0,1], in which case one of μP\mu_{P} and μQ\mu_{Q} must be a global minimum, again a contradiction). Hence, this sub-case does not occur.

Case 3b: If μP(s)μQ(s)0\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})\leq 0, then we choose

γ=μQ(s)μQ(s)μP(s),1γ=μP(s)μP(s)μQ(s).\gamma=\frac{\mu^{\prime}_{Q}(s^{*})}{\mu^{\prime}_{Q}(s^{*})-\mu^{\prime}_{P}(s^{*})},\quad 1-\gamma=\frac{\mu^{\prime}_{P}(s^{*})}{\mu^{\prime}_{P}(s^{*})-\mu^{\prime}_{Q}(s^{*})}. (88)

The assumption μP(s)μQ(s)0\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})\leq 0 ensures that these quantities both lie in [0,1][0,1]. Then, taking the derivative in (86), we obtain

μ~n,γ(s)=(γnμP(s)+(1γ)nμQ(s))|s=s=0.\tilde{\mu}^{\prime}_{n,\gamma}(s)=(\gamma n\mu^{\prime}_{P}(s)+(1-\gamma)n\mu^{\prime}_{Q}(s))|_{s=s^{*}}=0. (89)

Since μ~n,γ\tilde{\mu}_{n,\gamma} is convex (following from μP\mu_{P} and μQ\mu_{Q} being convex), this implies that μ~n,γ\tilde{\mu}_{n,\gamma} must attain its global minimum at s=ss=s^{*}. In addition, since μP(s)=μQ(s)\mu_{P}(s^{*})=\mu_{Q}(s^{*}), their common value must equal μ\mu^{*}, which further implies μ~n,γ(s)nμ\tilde{\mu}_{n,\gamma}(s)\geq n\cdot\mu^{*}. ∎

We now relate μn,γ(s)\mu_{n,\gamma}(s) to μ~n,γ(s)\tilde{\mu}_{n,\gamma}(s). Recall from Assumption 1 that μP(s)μP(1s)\mu_{P}(s)\leq\mu_{P}(1-s) and μQ(s)μQ(1s)\mu_{Q}(s)\leq\mu_{Q}(1-s). If s[0,12]s\in[0,\frac{1}{2}], then substituting these into (83) gives

μn,γ(s)μ~n,γ(s).\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(s). (90)

On the other hand, if s[12,1]s\in[\frac{1}{2},1], then Assumption 1 gives μn,γ(s)μ~n,γ(1s)\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(1-s), and substitution into (83) gives

μn,γ(s)μ~n,γ(1s).\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(1-s). (91)

Combining these two cases, we have for all s[0,1]s\in[0,1] that μn,γ(s)μ~n,γ(s)\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(s^{\prime}) for one of the two choices s{s,1s}s^{\prime}\in\{s,1-s\}, and hence, Lemma 8 implies there exists γ[0,1]\gamma\in[0,1] such that

mins[0,1]μn,γ(s)nμ.\min_{s\in[0,1]}\mu_{n,\gamma}(s)\geq n\mu^{*}. (92)

In view of the fact that (82) is the optimal error exponent, we have shown that there always exists γ[0,1]\gamma\in[0,1] such that the error exponent is upper bounded by μ-\mu^{*}.

To complete the proof, we also need to show that for all γ[0,1]\gamma\in[0,1], there exist choices of the codewords and s[0,1]s\in[0,1] such that μn,γ(s)nμ\mu_{n,\gamma}(s)\leq n\mu^{*}. This is much simpler, following directly by setting βP=βQ=1\beta_{P}=\beta_{Q}=1 (i.e., choosing the all-zeros and all-ones codewords) in (83), setting s=ss=s^{*}, upper bounding both μP(s)\mu_{P}(s^{*}) and μQ(s)\mu_{Q}(s^{*}) by μ\mu^{*}.

-C A Case Where Feedback Increases the 1-Hop Exponent

Let PP be a BSC with crossover probability p(0,12)p\in\big{(}0,\frac{1}{2}\big{)}, and let QQ be a reverse Z-channel (RZ-channel for short) with Q(1|1)=1Q(1|1)=1 and Q(1|0)=q(0,1)Q(1|0)=q\in(0,1). Note that Assumption 1 holds for these channels, as discussed just above Lemma 5. Consider the following communication strategy with partial feedback:

  • Repeatedly input Θ\Theta to the BSC in the first γn\gamma n channel uses.

  • After observing the feedback, if more than half the BSC bits are flipped, send the all-0s sequence over the RZ-Channel in (1γ)n(1-\gamma)n uses, and otherwise send the all-11s sequence.

At the decoder, an initial estimate Θ~\tilde{\Theta} is formed based on maximum-likelihood decoding (i.e., a majority vote) on the γn\gamma n received BSC symbols. Then, after receiving the RZ-channel output, the decision is reversed (i.e., Θ^=1Θ~\hat{\Theta}=1-\tilde{\Theta}) if any 0s are observed among the (1γ)n(1-\gamma)n received symbols; otherwise, the decision is kept (i.e., Θ^=Θ~\hat{\Theta}=\tilde{\Theta}). Essentially, the Z-channel uses are used to tell the decoder whether the initial decision should be changed or not.

If the initial decision Θ~\tilde{\Theta} is correct, then the final decision is correct with conditional probability one, since the all 11s sequence is received noiselessly. On the other hand, if the initial decision is incorrect, then the probability that it fails to be reversed is q(1γ)nq^{(1-\gamma)n}. Hence, an error only occurs if both steps fail, and the overall error exponent EE of this strategy is given by

E=γEBSC+(1γ)ln1q,E=\gamma E_{\rm BSC}+(1-\gamma)\ln\frac{1}{q}, (93)

where EBSC=D(12p)=12ln(14p(1p))E_{\rm BSC}=D\big{(}\frac{1}{2}\,\big{\|}p\big{)}=\frac{1}{2}\ln\big{(}\frac{1}{4p(1-p)}\big{)} is the optimal error exponent for transmitting two messages over a BSC (see (5)).

In Figure 5, we compare the exponent in (93) to that without feedback (Lemma 5), setting p=0.2p=0.2 and q=0.8q=0.8, which turns out to make the optimal exponents of PP and QQ identical. In this example, we see that feedback strictly increases the exponent for all γ(0,1)\gamma\in(0,1). Intuitively, the improvement comes from being able to “reverse” initial “bad events” to turn them into good events, which is not possible in the absence of feedback.

Refer to caption
Figure 5: Comparison of the error exponent under the two-round coding scheme with feedback vs. the error exponent without feedback, when PP is a BSC with parameter p=0.2p=0.2, and QQ is a Z-channel with parameter q=0.8q=0.8.

-D Proof of Lemma 6 (Tightness for Symmetric QQ)

Throughout the proof, we let (y|0)\mathbb{P}(\vec{y}|0) be a shorthand for (Y=y|Θ=0)\mathbb{P}(\vec{Y}=\vec{y}|\Theta=0), and similarly for (y|1)\mathbb{P}(\vec{y}|1) and other analogous quantities.

We make use of Theorem 3, and accordingly consider a (feedback) communication scenario with γn\gamma n uses of PP and (1γ)n(1-\gamma)n uses of QQ. Since we assume Θ\Theta to be equiprobable on {0,1}\{0,1\}, the optimal decoding rule given a received sequence y\vec{y} is to set Θ^=1\hat{\Theta}=1 if (Y=y|Θ=1)>(Y=y|Θ=0)\mathbb{P}(\vec{Y}=\vec{y}|\Theta=1)>\mathbb{P}(\vec{Y}=\vec{y}|\Theta=0), and Θ^=0\hat{\Theta}=0 otherwise. Under this optimal rule, the error probability Pe=(Θ^Θ)P_{\rm e}=\mathbb{P}(\hat{\Theta}\neq\Theta) is given by [10, Eq. (3.2)]

Pe=12ymin((y|0),(y|1)).\displaystyle P_{\rm e}=\frac{1}{2}\sum_{\vec{y}}\min(\mathbb{P}(\vec{y}|0),\mathbb{P}(\vec{y}|1)). (94)

We momentarily consider the case that there is no feedback, in which we can simply define x\vec{x} and x\vec{x}^{\prime} to be the two length-nn codewords (one per message), and choose these codewords to minimize the error probability, yielding

Peno-fb=12minx,xymin((y|x),(y|x)).\displaystyle P_{\rm e}^{\text{no-fb}}=\frac{1}{2}\min_{\vec{x},\vec{x}^{\prime}}\sum_{\vec{y}}\min(\mathbb{P}(\vec{y}|\vec{x}),\mathbb{P}(\vec{y}|\vec{x}^{\prime})). (95)

Since this expression is still exactly equal to smallest possible error probability, its corresponding exponent is E1(P,Q,γ)E_{1}(P,Q,\gamma), defined analogously to E2(P,Q,γ)E_{2}(P,Q,\gamma) but without feedback.

In the two-round setting with feedback considered in Theorem 3, slightly more care is needed. We proceed by explicitly splitting up the codeword and received sequence as x=(xP,xQ)\vec{x}=(\vec{x}_{P},\vec{x}_{Q}) and y=(yP,yQ)\vec{y}=(\vec{y}_{P},\vec{y}_{Q}), and writing the following for θ{0,1}\theta\in\{0,1\}:

(y|θ)=Pγn(yP|xP(θ))Q(1γ)n(yQ|xQ(yP,θ)),\mathbb{P}(\vec{y}|\theta)=P^{\gamma n}(\vec{y}_{P}|\vec{x}_{P}(\theta))Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}(\vec{y}_{P},\theta)), (96)

where xQ(yP,θ)\vec{x}_{Q}(\vec{y}_{P},\theta) explicitly writes xQ\vec{x}_{Q} as a function of yP\vec{y}_{P} and θ\theta, and similarly for xP=xP(θ)\vec{x}_{P}=\vec{x}_{P}(\theta). Substituting into (94), we obtain

Pe=12yP,yQmin(Pγn(yP|xP(0))Q(1γ)n(yQ|xQ(yP,0)),\displaystyle P_{\rm e}=\frac{1}{2}\sum_{\vec{y}_{P},\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}|\vec{x}_{P}(0))Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}(\vec{y}_{P},0)),
Pγn(yP|xP(1))Q(1γ)n(yQ|xQ(yP,1)))\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}|\vec{x}_{P}(1))Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}(\vec{y}_{P},1))\big{)} (97)
12minxP,xPyPminxQ,xQyQmin(Pγn(yP|xP)Q(1γ)n(yQ|xQ),\displaystyle\geq\frac{1}{2}\min_{\vec{x}_{P},\vec{x}^{\prime}_{P}}\sum_{\vec{y}_{P}}\min_{\vec{x}_{Q},\vec{x}^{\prime}_{Q}}\sum_{\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}|\vec{x}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}),
Pγn(yP|xP)Q(1γ)n(yQ|xQ)).\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}|\vec{x}^{\prime}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}^{\prime}_{Q})\big{)}. (98)

We now observe that this expression coincides with the non-feedback case given in (95), except that the minimum over (xQ,xQ)(\vec{x}_{Q},\vec{x}^{\prime}_{Q}) comes after the summation over yP\vec{y}_{P} (since the former is allowed to depend on the latter).

While this ordering can be significant in general, we now show that it has no effect when QQ is a BSC. Without loss of optimality, we can assume that xQ\vec{x}_{Q} and xQ\vec{x}^{\prime}_{Q} differ in every entry, since any entries that coincide would be independent of Θ\Theta (at least conditionally given yP\vec{y}_{P}) and reveal no information (or viewed differently, could be simulated at the decoder). However, when QQ is a BSC, its symmetry implies that any quantity of the form

yQmin(aQ(1γ)n(yQ|xQ),bQ(1γ)n(yQ|xQ))\sum_{\vec{y}_{Q}}\min(aQ^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}),bQ^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}^{\prime}_{Q})) (99)

(with constants aa and bb) is identical for all such (xQ,xQ)(\vec{x}_{Q},\vec{x}^{\prime}_{Q}) pairs; the precise choice only amounts to re-ordering terms in the summation over yQ\vec{y}_{Q}. It follows that interchanging the order of yP\sum_{\vec{y}_{P}} and minxQ,xQ\min_{\vec{x}_{Q},\vec{x}^{\prime}_{Q}} in (98) has no impact, and we recover the non-feedback expression (95), whose error exponent is defined to be E1(P,Q,γ)E_{1}(P,Q,\gamma).

It only remains to show that the error exponent E1(P,Q,γ)E_{1}(P,Q,\gamma) is no higher than the learning rate in (45). To show this, we re-use some of the findings from Appendix -B.888Assumption 1 was not applied until after Lemma 8 in Appendix -B, and we will not use it here. In particular, the error exponent is still given by (82), and μn,γ(s)\mu_{n,\gamma}(s) still satisfies (83), assuming without loss of optimality that the two codewords x\vec{x} and x\vec{x}^{\prime} differ in every entry.

For any fixed ss, we can make μn,γ(s)\mu_{n,\gamma}(s) in (83) smaller or equal by choosing both βP\beta_{P} and βQ\beta_{Q} to be either zero or one (e.g., if μP(s)<μP(1s)\mu_{P}(s)<\mu_{P}(1-s) then we set βP=1\beta_{P}=1). That is, without loss of optimality, every input to PP is identical, and every input to QQ is identical. Assuming without loss of generality that βP=1\beta_{P}=1, and noting that the symmetry of the BSC implies μQ(s)=μQ(1s)\mu_{Q}(s)=\mu_{Q}(1-s), it follows that we can simplify (83) to

μn,γ(s)=γnμP(s)+(1γ)nμQ(s).\mu_{n,\gamma}(s)=\gamma n\mu_{P}(s)+(1-\gamma)n\mu_{Q}(s). (100)

This precisely coincides with the quantity μ~n,γ(s)\tilde{\mu}_{n,\gamma}(s) introduced in (86), and hence, we can directly apply Lemma 8 to obtain μn,γ(s)nμ\mu_{n,\gamma}(s)\geq n\mu^{*} for some γ[0,1]\gamma\in[0,1] and all s[0,1]s\in[0,1], where μ\mu^{*} is defined in (84). Since μ\mu^{*} coincides with (45), this completes the proof.

-E Proof of Corollary 2 (Strict Improvement in the Converse)

To see that the converse in Theorem 3 can strictly improve on the trivial one in (67), we re-use the example shown in Figure 5 in Appendix -C, but with PP and QQ interchanged (i.e., PP is a reverse Z-channel with parameter 0.80.8, and QQ is a BSC with parameter 0.20.2). In this case, it is straightforward to verify that the “No Feedback” curve in Figure 5 is simply flipped with respect to the mid-point γ=12\gamma=\frac{1}{2}, and in particular, all values of γ(0,1)\gamma\in(0,1) still give a strictly smaller exponent than the endpoints γ=0\gamma=0 and γ=1\gamma=1.

The key difference due to interchanging PP and QQ is that in view of the above proof of Lemma 6, the feedback no longer helps, and hence, the same curve just discussed serves as a valid converse even with the single round of feedback. Hence, by Theorem 3 and the fact the simple converse in (67) corresponds to taking the better of γ=0\gamma=0 and γ=1\gamma=1, we conclude that a strict improvement has indeed been shown.

While the reliance of the preceding proof on a numerical evaluation can be circumvented, we omit such an approach, since we believe that this example has no ambiguity with respect to potential numerical issues.

-F Proof of Theorem 4 (Converse for General Block-Structured Protocols)

We first consider a single block. Letting YY denote the kk symbols received by the teacher, X^\hat{X} denote the kk transmitted bits, and WW denote the block received by the student (here we omit the vector notation ()\vec{(\cdot)} to lighten the exposition), we have the Markov chain ΘYX^W\Theta\rightarrow Y\rightarrow\hat{X}\rightarrow W.

We begin with an inequality on mixture distributions (cf., Definition 1). Let AA and BB be probability distributions, and suppose AA follows a mixture distribution described by [p1,A1;;pk,Ak][p_{1},A_{1};\ldots;p_{k},A_{k}]. Then, observe that

ρ(A,B,s)\displaystyle\rho(A,B,s) =x(i=1mpi(Ai=x))1s(B=x)s\displaystyle=\sum_{x}\bigg{(}\sum_{i=1}^{m}p_{i}\mathbb{P}(A_{i}=x)\bigg{)}^{1-s}\mathbb{P}(B=x)^{s} (101)
i=1mpix(Ai=x)1s(B=x)s\displaystyle\geq\sum_{i=1}^{m}p_{i}\sum_{x}\mathbb{P}(A_{i}=x)^{1-s}\mathbb{P}(B=x)^{s} (102)
mini=1,,mρ(Ai,B,s),\displaystyle\geq\min_{i=1,\dotsc,m}\rho(A_{i},B,s), (103)

where (102) applies Jensen’s inequality to the concave function f(z)=z1sf(z)=z^{1-s} (s[0,1]s\in[0,1]), and (103) lower bounds the average by the minimum. The same argument applies when AA is fixed and BB is a mixture.

For convenience, define μ(A,B,s)=lnρ(A,B,s)\mu(A,B,s)=\ln\rho(A,B,s). Considering two length-kk sequences x,x\vec{x},\vec{x}^{\prime}, by the additive property of μ\mu (i.e., multiplicative property of ρ\rho) for product distributions, the corresponding conditional distributions for WW satisfy

μ(PW|X^=x,PW|X^=x,s)k1μQ(s)+k2μQ(1s),\mu(P_{W|\hat{X}=\vec{x}},P_{W|\hat{X}=\vec{x}^{\prime}},s)\geq k_{1}\mu_{Q}(s)+k_{2}\mu_{Q}(1-s), (104)

where k1k_{1} is the number of indices where x\vec{x} is zero and x\vec{x}^{\prime} is one, and vice versa for k2k_{2}.

Supposing first that s[0,12]s\in[0,\frac{1}{2}], we can use the assumption μQ(s)μQ(1s)\mu_{Q}(s)\leq\mu_{Q}(1-s) from Assumption 1, and weaken (104) to

μ(PW|X^=x,PW|X^=x,s)kμQ(s)\mu(P_{W|\hat{X}=\vec{x}},P_{W|\hat{X}=\vec{x}^{\prime}},s)\geq k\mu_{Q}(s) (105)

since k1+k2kk_{1}+k_{2}\leq k. Observe that (105) gives a lower bound that holds uniformly with respect to x,x\vec{x},\vec{x}^{\prime}. Then, using the property (103) and the fact that PW|Θ=0P_{W|\Theta=0} and PW|Θ=1P_{W|\Theta=1} are mixture distributions mixed by X^\hat{X}, we see that (105) implies (for s[0,12]s\in[0,\frac{1}{2}]) that

μ(PW|Θ=0,PW|Θ=1,s)kμQ(s).\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\mu_{Q}(s). (106)

This gives us a QQ-dependent lower bound, and the PP-dependent lower bound turns out to be simpler: Using the Markov chain ΘYX^\Theta\rightarrow Y\rightarrow\hat{X}, along with the data processing inequality for μ(,,s)\mu(\cdot,\cdot,s),999The data processing inequality for μ\mu is equivalent to that for Rényi divergence, defined as Dα(PQ)=1α1lnxP(x)αQ(x)1αD_{\alpha}(P\|Q)=\frac{1}{\alpha-1}\ln\sum_{x}P(x)^{\alpha}Q(x)^{1-\alpha}. For the data processing inequality for Rényi divergence, see for example [12, Example 2]. we have

μ(PW|Θ=0,PW|Θ=1,s)μ(PY|Θ=0,PY|Θ=1,s)=kμP(s).\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\mu(P_{Y|\Theta=0},P_{Y|\Theta=1},s)=k\mu_{P}(s). (107)

Combining the lower bounds, we have for all s[0,12]s\in[0,\frac{1}{2}] that

μ(PW|Θ=0,PW|Θ=1,s)kmax(μP(s),μQ(s)).\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\max(\mu_{P}(s),\mu_{Q}(s)). (108)

If s[12,1]s\in[\frac{1}{2},1], then Assumption 1 gives μ(s)μ(1s)\mu(s)\geq\mu(1-s), and the analog of (106) is

μ(PW|Θ=0,PW|Θ=1,s)μQ(1s).\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\mu_{Q}(1-s). (109)

Since (107) holds for all s[0,1]s\in[0,1], it follows that for any s[12,1]s\in[\frac{1}{2},1], we have

μ(PW|Θ=0,PW|Θ=1,s)kmax(μP(1s),μQ(1s)).\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\max(\mu_{P}(1-s),\mu_{Q}(1-s)). (110)

Combining the cases s[0,12]s\in[0,\frac{1}{2}] and s[12,1]s\in[\frac{1}{2},1], we deduce that

mins[0,1]μ(PW|Θ=0,PW|Θ=1,s)mins[0,12]kmax(μP(s),μQ(s)).\min_{s\in[0,1]}\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\min_{s\in[0,\frac{1}{2}]}k\max(\mu_{P}(s),\mu_{Q}(s)). (111)

Having studied a single block, we are now in a position to study the overall combination of nk1\frac{n}{k}-1 blocks,101010The first block received by the student does not depend on Θ\Theta, and provides no distinguishing power. defining the nn-letter version of μ\mu as

μn(s)=lnz(Z=z|Θ=0)1s(Z=z|Θ=1)s,\mu_{n}(s)=\ln\sum_{\vec{z}}\mathbb{P}(\vec{Z}=\vec{z}\,|\,\Theta=0)^{1-s}\mathbb{P}(\vec{Z}=\vec{z}\,|\,\Theta=1)^{s}, (112)

where Z\vec{Z} is the entire length-nn sequence received by the student. Assuming momentarily that the teacher employs an identical strategy within each block, the additive property of μ\mu for product distributions (arising from PP and QQ being memoryless) gives

μn(s)=(nk1)μ(PW|Θ=0,PW|Θ=1,s),\mu_{n}(s)=\left(\frac{n}{k}-1\right)\mu(P_{W|\Theta=0},P_{W|\Theta=1},s), (113)

and substituting (111) gives

mins[0,1]μn(s)mins[0,12](nk)max(μP(s),μQ(s)).\min_{s\in[0,1]}\mu_{n}(s)\geq\min_{s\in[0,\frac{1}{2}]}(n-k)\max(\mu_{P}(s),\mu_{Q}(s)). (114)

In fact, this lower bound holds even if the teacher applies a different strategy between blocks, since we showed that (111) holds regardless of the strategy used within the block.

Letting s=argmins[0,1]μn(s)s^{*}=\operatorname*{arg\,min}_{s\in[0,1]}\mu_{n}(s), we have from [10, Thm. 5] (a more general form of Lemma 2) that, for any decoding rule employed by the student, the error probability is lower bounded by

(Θ^nΘ)18exp(μn(s)2μn′′(s)),\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\geq\frac{1}{8}\exp(\mu_{n}(s^{*})-\sqrt{2\mu^{\prime\prime}_{n}(s^{*})}), (115)

where μn′′(s)\mu^{\prime\prime}_{n}(s) denotes the second derivative when s(0,1)s\in(0,1), and the appropriate limiting value when s{0,1}s\in\{0,1\}. Again using the additive property of μ\mu, the following holds if the teacher employs the same per-block strategy in each block:

μn′′(s)=(nk1)μ′′(PW|Θ=0,PW|Θ=1,s).\mu_{n}^{\prime\prime}(s)=\Big{(}\frac{n}{k}-1\Big{)}\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s). (116)

Once again, our analysis also extends immediately to the case of varying per-block strategies.

In Lemma 9 below, we show that μ′′(PW|Θ=0,PW|Θ=1,s)=O(k2)\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s)=O(k^{2}), and substitution into (116) gives μn′′(s)=O(nk)\mu^{\prime\prime}_{n}(s)=O(nk). The assumption k=o(n)k=o(n) then yields μn′′(s)=o(n2)\mu^{\prime\prime}_{n}(s)=o(n^{2}), or μn′′(s)=o(n)\sqrt{\mu^{\prime\prime}_{n}(s)}=o(n). Substituting this scaling into (115), applying (114), and taking suitable limits, we obtain

lim supn1nln(ΘnΘ)mins[0,12]max(μP(s),μQ(s)).\limsup_{n\rightarrow\infty}\,-\frac{1}{n}\ln\mathbb{P}(\Theta_{n}\neq\Theta)\leq-\min_{s\in[0,\frac{1}{2}]}\max(\mu_{P}(s),\mu_{Q}(s)). (117)

We deduce Theorem 4 by further upper bounding the right-hand side by expanding the minimum from [0,12][0,\frac{1}{2}] to [0,1][0,1]; (117) additionally reveals that further restricting s[0,12]s\in[0,\frac{1}{2}] is without loss of optimality, which is unsurprising given Assumption 1.

It only remains to show the following.

Lemma 9.

For any fixed PP and QQ, and any s[0,1]s\in[0,1], it holds that μ′′(PW|Θ=0,PW|Θ=1,s)=O(k2)\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s)=O(k^{2}).

Proof.

To reduce notation, define P0=PW|Θ=0P_{0}=P_{W|\Theta=0} and P1=PW|Θ=1P_{1}=P_{W|\Theta=1}. It is shown in [10, p. 85] that μ′′\mu^{\prime\prime} can be written as the variance of a log-likelihood ratio:

μ′′(P0,P1,s)=Var[logP1(W~)P0(W~)],\mu^{\prime\prime}(P_{0},P_{1},s)={\rm Var}\bigg{[}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\bigg{]}, (118)

where W~\tilde{W} is distributed according to the following “tilted” distribution when s(0,1)s\in(0,1):

P~(w)=P0(w)1sP1(w)swP0(w)1sP1(w)s.\tilde{P}(\vec{w})=\frac{P_{0}(\vec{w})^{1-s}P_{1}(\vec{w})^{s}}{\sum_{\vec{w}^{\prime}}P_{0}(\vec{w}^{\prime})^{1-s}P_{1}(\vec{w}^{\prime})^{s}}. (119)

In addition, when ss equals an endpoint (i.e., 0 or 1), we can use the same expression for P~\tilde{P}, except that w\vec{w} must be restricted to satisfy both P0(w)>0P_{0}(\vec{w})>0 and P1(w)>0P_{1}(\vec{w})>0 (with P~(w)=0\tilde{P}(\vec{w})=0 otherwise) [10].

We upper bound the variance by the second moment, i.e., 𝔼[(logP1(W~)P0(W~))2]\mathbb{E}\big{[}\big{(}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\big{)}^{2}\big{]}, and proceed by further upper bounding the latter. By the definition of P~\tilde{P}, any w\vec{w} such that P0(w)=0P_{0}(\vec{w})=0 or P1(w)=0P_{1}(\vec{w})=0 does not contribute to the second moment. On the other hand, for θ{0,1}\theta\in\{0,1\}, if Pθ(w)0P_{\theta}(\vec{w})\neq 0, then we have

Pθ(w)\displaystyle P_{\theta}(\vec{w}) =yPk(y|θ)Qk(w|x^(y))\displaystyle=\sum_{\vec{y}}P^{k}(\vec{y}|\theta)Q^{k}(\vec{w}|\vec{\hat{x}}(\vec{y})) (120)
PminkQmink,\displaystyle\geq P_{\min}^{k}Q_{\min}^{k}, (121)

where PkP^{k} is the kk-fold product of PP (and similarly for QkQ^{k}), x^(y)\vec{\hat{x}}(\vec{y}) is the transmitted X^\hat{X} sequence when the teacher receives y\vec{y}, and Pmin,QminP_{\min},Q_{\min} are the smallest non-zero transition probabilities of PP and QQ. It follows that each non-zero Pθ(w)P_{\theta}(\vec{w}) is bounded between PminkQminkP_{\min}^{k}Q_{\min}^{k} and 11, which implies that logP1(w)P0(w)=O(k)\log\frac{P_{1}(\vec{w})}{P_{0}(\vec{w})}=O(k), and hence 𝔼[(logP1(W~)P0(W~))2]=O(k2)\mathbb{E}\big{[}\big{(}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\big{)}^{2}\big{]}=O(k^{2}).

Acknowledgment

This work was supported by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.

References

  • [1] V. Jog and P. L. Loh, “Teaching and learning in uncertainty,” IEEE Transactions on Information Theory, vol. 67, no. 1, pp. 598–615, 2021.
  • [2] W. Huleihel, Y. Polyanskiy, and O. Shayevitz, “Relaying one bit across a tandem of binary-symmetric channels,” IEEE International Symposium on Information Theory (ISIT), 2019.
  • [3] S. Rajagopalan and L. Schulman, “A coding theorem for distributed computation,” ACM Symposium on Theory of Computing, 1994.
  • [4] X. Vives, “How fast do rational agents learn?” The Review of Economic Studies, vol. 60, no. 2, pp. 329–347, 1993.
  • [5] A. Jadbabaie, P. Molavi, and A. Tahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks,” Columbia Business School Research Paper, no. 13-28, 2013.
  • [6] P. Molavi, A. Tahbaz-Salehi, and A. Jadbabaie, “Foundations of non-bayesian social learning,” Columbia Business School Research Paper, no. 15-95, 2017.
  • [7] M. Harel, E. Mossel, P. Strack, and O. Tamuz, “Rational Groupthink,” The Quarterly Journal of Economics, vol. 136, no. 1, pp. 621–668, 07 2020.
  • [8] A. El Gamal and Y.-H. Kim, Network information theory.   Cambridge university press, 2011.
  • [9] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence.   OUP Oxford, 2013.
  • [10] C. Shannon, R. Gallager, and E. Berlekamp, “Lower bounds to error probability for coding on discrete memoryless channels. i,” Information and Control, vol. 10, no. 1, p. 65–103, 1967.
  • [11] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.   Cambridge University Press, 2011.
  • [12] T. van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
Yan Hao Ling received the B.Comp. degree in computer science and the B.Sci. degree in mathematics from the National University of Singapore (NUS) in 2021. He is now a PhD student in the Department of Computer Science at NUS. His research interests are in the areas of information theory, statistical learning, and theoretical computer science.
Jonathan Scarlett (S’14 – M’15) received the B.Eng. degree in electrical engineering and the B.Sci. degree in computer science from the University of Melbourne, Australia. From October 2011 to August 2014, he was a Ph.D. student in the Signal Processing and Communications Group at the University of Cambridge, United Kingdom. From September 2014 to September 2017, he was post-doctoral researcher with the Laboratory for Information and Inference Systems at the École Polytechnique Fédérale de Lausanne, Switzerland. Since January 2018, he has been an assistant professor in the Department of Computer Science and Department of Mathematics, National University of Singapore. His research interests are in the areas of information theory, machine learning, signal processing, and high-dimensional statistics. He received the Singapore National Research Foundation (NRF) fellowship, and the NUS Early Career Research Award.