Optimal Rates of Teaching and
Learning Under Uncertainty

Yan Hao Ling and Jonathan Scarlett

Abstract

In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number $n$ of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as $n$ increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy $D\big{(}\frac{1}{2}\|\max(p,q)\big{)}$ , where $p$ and $q$ are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in $n$ . We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols.

⁰⁰0The authors are with the Department of Computer Science and Department of Mathematics, School of Computing, National University of Singapore (NUS). Jonathan Scarlett is also with the Institute of Data Science, NUS. Emails: [email protected]; [email protected]. This work was presented in part at the 2021 IEEE International Symposium on Information Theory (ISIT).

I Introduction

In several societal and technological domains, one is interested in how agents interact with their environment and with each other to attain goals such as learning information about the environment, conveying this information to other agents, and reaching a common consensus. While a comprehensive theoretical understanding of such problems would likely require highly sophisticated mathematical models, even the simplest models come with unique insights and challenges.

In this paper, we study a recently-proposed model of teaching and learning under uncertainty [1, 2], in which a teacher observes noisy information regarding an unknown 1-bit quantity $\Theta$ (the environment), and seeks to convey information to a student via a noisy channel to facilitate learning $\Theta$ . We establish the optimal learning rate (i.e., exponential decay of the error probability) of this problem, thereby settling two notable conjectures made in [1, 2] described in detail below. In addition, we study a generalization of the problem to more general binary-input discrete memoryless channels, and obtain the optimal learning rate in certain special cases of interest.

I-A Model and Definitions

Refer to caption — Figure 1: Illustration of the problem setup.

We first formalize the model, which is depicted in Figure 1. The unknown quantity of interest is a binary random variable $\Theta$ that takes either value in $\{0,1\}$ , each with probability $\frac{1}{2}$ .

At each time step $i\in\{1,\dotsc,n\}$ , the teacher observes $\Theta$ through BSC( $p$ ), a binary symmetric channel with error rate $p\in\big{(}0,1/2\big{)}$ :

\mathbb{P}(Y_{i}=\Theta)=1-p,~{}~{}~{}\mathbb{P}(Y_{i}=1-\Theta)=p.

(1)

The binary symmetric channel is memoryless, meaning that the $Y_{i}$ ’s are conditionally independent given $\Theta$ . In Section III, we will study more general and possibly non-symmetric binary-input discrete memoryless channels.

At time $i$ , the teacher then transmits binary information $\hat{X}_{i}$ to the student through another binary symmetric channel BSC( $q$ ), where $q\in\big{(}0,1/2\big{)}$ . Thus, denoting the student’s observations by $Z_{i}$ , we have

\mathbb{P}(\hat{X}_{i}=Z_{i})=1-q,~{}~{}~{}\mathbb{P}(\hat{X}_{i}=1-Z_{i})=q,

(2)

and the $Z_{i}$ are conditionally independent given $\hat{X}_{i}$ . Importantly, $\hat{X}_{i}$ must only be a function of $Y_{1},,\ldots,Y_{i}$ ; the teacher cannot look into the future.

At time $n$ , having received $Z_{1},\ldots,Z_{n}$ , the student makes an estimate of $\Theta$ , which we denote by $\hat{\Theta}_{n}$ (or sometimes simply $\hat{\Theta}$ ). Then, the learning rate is defined as

\mathcal{R}=\limsup_{n\rightarrow\infty}\left\{-\frac{1}{n}\ln\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\right\}.

(3)

We assume that the noise parameters $p$ and $q$ are known to both the teacher and student; removing this assumption may be an interesting direction for future work.

I-B Existing Results

The preceding motivation and setup follows that of Jog and Loh [1], and the same problem was also considered with different motivation and terminology by Huleihel, Polyanskiy, and Shayevitz [2], who framed the problem using the terminology of relay channels. While relay channels are already a central topic in information theory, the authors of [2] further motivated their study via the more recent information velocity problem, which was posed by Yury Polyanskiy [2] and is also captured under a general framework studied by Rajagopalan and Schulman [3]. In that problem, the goal is to reliably transmit a single bit over a long chain of relays while maintaining a non-vanishing ratio between the number of hops and the total transmission time. Below we will discuss a conjecture (for the 2-hop setting) made in [2] that is directly inspired by the connection with information velocity.

It is known that the student’s optimal strategy is the maximum likelihood decoder (assuming the teacher’s strategy is known) [2, 1]. In [1], it was discussed that optimal strategies can be difficult to analyze in general, and accordingly, the following two simpler student strategies were considered:

•

Majority rule: $\Theta_{n}$ is chosen according to the majority of the observations $Z_{1},\ldots,Z_{n}$ .
•

$\epsilon$ -majority rule: $\Theta_{n}$ is the majority of the last $\epsilon\cdot n$ observations for some $\epsilon\in(0,1)$ , with the rough idea being to allow time for the teacher to learn $\Theta$ first.

These student strategies were analyzed together with the following teacher strategies:

•

Simple forwarding: The teacher directly forwards its observation at each time step, i.e., $\hat{X}_{i}=Y_{i}$ .
•

Cumulative teaching: The teacher transmits its best estimate of $\Theta$ at each time step, i.e., $\hat{X}_{i}$ is the majority value among $Y_{1},\ldots,Y_{i}$ .

It was demonstrated in [1] that none of these strategies uniformly outperform one an another, and that any combination of them falls short of the $D(1/2\|\max(p,q))$ upper bound (i.e., converse) based on data processing arguments (here and subsequently, $D(p_{1}\|p_{2})$ denotes the relative entropy between Bernoulli distributions with parameters $p_{1}$ an $p_{2}$ ). The optimal learning rate was left as an open problem, and it was conjectured that at least part of the weakness is in the upper bound, i.e., that one cannot attain a rate of $D(1/2\|\max(p,q))$ .

In [2], both simple forwarding and cumulative teaching (i.e., relaying) were considered, along with a block-structured variant that only updates the majority every $\sqrt{n}$ time steps. These strategies were analyzed alongside the optimal maximum-likelihood learning (i.e., decoding) rule. While the learning rates attained again fall short of the $D(1/2\|\max(p,q))$ upper bound, it was shown that one comes within a factor $\frac{3}{4}$ when $p=q$ and $p\to\frac{1}{2}$ (i.e., the high noise setting). It was conjectured in [2] that using a more sophisticated protocol, this constant $\frac{3}{4}$ could be made arbitrarily close to one in this high-noise limit. As hinted above, the motivation for this conjecture was the fact that positive information velocity is known to be attainable; the authors of [2] discuss that for this to be possible, the 1-hop and 2-hop exponents should match in the high-noise limit so that “information propagation does not slow down”.

The study of [1] was inspired by earlier works on social learning [4, 5, 6], and most similar to [7]. Since these are less directly relevant to our work, we refer the reader to [1] for a detailed discussion of the differences.

As we will see shortly, our main result disproves the above-mentioned conjecture of [1], and not only confirms the conjecture of [2] for the regime $p\to\frac{1}{2}$ , but also significantly strengthens it by extending it to all $p\in\big{(}0,\frac{1}{2}\big{)}$ .

II Optimal Rate Under Binary Symmetric Noise

In this section, we focus on the above-described setup in which both channels in the system are binary symmetric channels. We turn to more general binary-input discrete memoryless channels in Section III.

II-A Statement of Optimal Learning Rate

Our first main result is stated as follows.

Theorem 1.

Under the preceding setup, for any $p,q\in\big{(}0,1/2\big{)}$ , let $\mathcal{R}^{*}(p,q)$ be the supremum of learning rates across all teaching and learning protocols. Then

\mathcal{R}^{*}(p,q)\geq D\Big{(}\frac{1}{2}\,\Big{\|}\max(p,q)\Big{)},

(4)

where $D(a\|b)=a\ln\frac{a}{b}+(1-a)\ln\frac{1-a}{1-b}$ denotes the binary relative entropy function (in base $e$ ).

We achieve this result via a novel block-structured teaching strategy. In contrast with [2], the teacher transmits bits based on a single block at a time, ignoring earlier blocks. The teacher encodes the number of 1s received in the block by sending a sorted sequence (i.e., a string of $1$ s followed by a string of $0$ s), with the transition point chosen according to a carefully-chosen non-linear function of the number of 1s observed. The student performs maximum-likelihood decoding, and we study the error probability via the Bhattacharyya coefficient. The details are given in Sections II-B and II-C, and in Section II-D we show that the overall computation time required by the teacher and student is linear in $n$ .

Comparison to upper bound (converse). If $q=0$ , then the optimal learning rate is given by the majority decoder, which achieves a learning rate of [1, 2]

D\Big{(}\frac{1}{2}\,\Big{\|}p\Big{)}=\frac{1}{2}\ln\left(\frac{1}{4p(1-p)}\right).

(5)

Through data processing inequalities, we have $\mathcal{R}^{*}(p,q)\leq\mathcal{R}^{*}(p,0)=D\big{(}\frac{1}{2}\|p\big{)}$ , and an analogous argument yields $\mathcal{R}^{*}(p,q)\leq D\big{(}\frac{1}{2}\|q\big{)}$ . We therefore obtain the upper bound

\mathcal{R}^{*}(p,q)\leq D\Big{(}\frac{1}{2}\,\Big{\|}\max(p,q)\Big{)}.

(6)

Theorem 1 shows that this bound is tight, disproving the above-mentioned conjecture of [1], and both confirming and strengthening the conjecture of [2].

In the remainder of the section, we only consider the case $p=q$ . This is without loss of generality, since in the general case, one can simulate additional noise into the system to make the noise parameters both equal to $\max(p,q)$ .

II-B Teacher and Student Protocol

Before describing the protocol, we first need to introduce some statistical tools.

II-B1 Statistical Tools and Student Decoding Rule

Let $A$ and $B$ be discrete random variables defined on some finite alphabet $\Omega$ . The Bhattacharyya coefficient is a real number in $[0,1]$ given by

\rho(A,B)=\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)}.

(7)

Overloading notation, we will sometimes also denote this quantity by $\rho(P_{A},P_{B})$ , where $P_{A}$ and $P_{B}$ are the associated probability mass functions.

The Bhattacharyya coefficient measures the ‘closeness’ between the distributions of $A$ and $B$ . Values close to 0 indicate easily separated distributions, while values close to 1 indicate very similar distributions. In particular, $\rho(A,B)=0$ if and only if $A$ and $B$ have disjoint supports, and $\rho(A,B)=1$ if and only if they are identically distributed.

We will make use of some standard properties of the Bhattacharyya coefficient:

•

Suppose that we have a random variable $X$ which is known to follow one of the two known distributions, $A$ or $B$ . If we draw one instance of $X$ and use it to decide which distribution $X$ came from, then there exists a strategy to achieve an error probability of at most $\rho(A,B)$ . To achieve this, we simply use the maximum likelihood test: Given $X=x$ , we decide that $A$ is the true distribution if $\mathbb{P}(A=x)>\mathbb{P}(B=x)$ , and decide that $B$ is the true distribution otherwise. The probability of selecting distribution $A$ when $B$ is indeed the true distribution is given by

	$\displaystyle\sum_{x\in\Omega}\mathbb{P}(B=x)\mathbbm{1}_{\mathbb{P}(A=x)>\mathbb{P}(B=x)}$
	$\displaystyle\qquad\leq\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)}.$		(8)

The other error type is upper bounded similarly, and combining the two error types yields the desired bound:

\mathbb{P}({\rm error})\leq\rho(A,B).

(9)

•

If $A_{1},A_{2}$ are independent, and similarly for $B_{1},B_{2}$ , then

$\rho((A_{1},A_{2}),(B_{1},B_{2}))=\rho(A_{1},B_{1})\cdot\rho(A_{2},B_{2}).$ (10)

This follows by a direct expansion of (7).
•

Let $\vec{A}=(A_{1},A_{2},\ldots,A_{m})$ be $m$ i.i.d. copies of $A$ and $\vec{B}={B_{1},B_{2},\ldots,B_{m}}$ be $m$ i.i.d. copies of $B$ . Then

$\rho(\vec{A},\vec{B})=\rho(A,B)^{m}.$ (11)

This is a direct consequence of (10).

In accordance with (9), we assume throughout the paper that the student adopts the optimal maximum-likelihood decoding rule, where the two underlying distributions are those of $\Theta=1$ vs. $\Theta=0$ given $Z_{1},\dotsc,Z_{n}$ .

II-B2 Block-Structured Design and Teaching Strategy

The transmission protocol of length $n$ is broken down into $n/k$ blocks, each of length $k$ . For each $i=1,2,3,\ldots,\frac{n}{k}-1$ , we let $(\hat{X}_{ik+1},\ldots,\hat{X}_{(i+1)k})$ be a (deterministic) function of $(Y_{(i-1)k+1},\dotsc,Y_{ik})$ . The values of $\hat{X}_{1},\ldots,\hat{X}_{k}$ are ignored by the student. The function used to generate $\vec{\hat{X}}\in\{0,1\}^{k}$ from $\vec{Y}\in\{0,1\}^{k}$ is given in Section II-B3.

Under this design, the $i$ -th block of $\hat{X}$ only depends on the $(i-1)$ -th block of $Y$ ; see Figure 2 for an illustration. To simplify notation, we define

W_{i}=(Z_{ik+1},,\ldots,Z_{(i+1)k})

(12)

to be the $i$ -th block received by student.

Figure 2: A diagrammatic representation of the block protocol

The memoryless properties of the channels imply that the $W_{i}$ ’s are conditionally independent given $\Theta$ . Since the function that generates $\vec{\hat{X}}$ from $\vec{Y}$ is the same every time, their distributions are also identical; we denote this common distribution by $P_{W}$ .

The student needs to distinguish between $n/k-1$ i.i.d. copies of $P_{W|\Theta=1}$ and $n/k-1$ i.i.d. copies of $P_{W|\Theta=0}$ . Using (9) and (11), we have

\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\leq\rho(P_{W|\Theta=1},P_{W|\Theta=0})^{n/k-1}.

(13)

By setting $k$ to be a fixed constant as $n$ increases, it follows that the learning rate is lower bounded by

	$\displaystyle\mathcal{R}$	$\displaystyle=\limsup_{n\rightarrow\infty}\left\{-\frac{1}{n}\ln\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\right\}$		(14)
		$\displaystyle\geq-\frac{1}{k}\ln\rho(P_{W\|\Theta=1},P_{W\|\Theta=0}),$		(15)

where $W$ implicitly depends on $k$ . In fact, while constant $k$ suffices for our purposes, a similar argument holds when $k$ increases but satisfies $k=o(n)$ , in which case the operation $\limsup_{k\rightarrow\infty}$ should also be included on the right-hand side of (15).

Remark 1.

This block structure bears high-level similarity to block-Markov coding in the analysis of relay channels [8, Sec. 16.4]. However, the details here are very different from existing analyses that use joint typicality and related notions. The error exponent associated such analyses would be low due to the effective block length of $k$ , whereas we maintain a high error exponent via joint decoding over the entire length- $n$ received sequence.

II-B3 Protocol within Each Block

In each block, the teacher receives $k$ noisy bits (e.g., $Y_{1},\dotsc,Y_{k}$ in the first block). Let $\alpha$ be the fraction of 1 among these bits (i.e. the number of 1s divided by $k$ ).

The teacher sends $k\cdot f(\alpha)$ bits of 1 followed by $k(1-f(\alpha))$ bits of $0$ (with a delay of one block), where $f:[0,1]\rightarrow[0,1]$ is defined as

f(\alpha)=\begin{cases}0&0\leq\alpha<p\\ \frac{D(\alpha\|p)}{2D(1/2\|p)}&p\leq\alpha\leq 1/2\\ 1-f(1-\alpha)&1/2<\alpha<1-p\\ 1&1-p\leq\alpha\leq 1.\end{cases}

(16)

A sample plot is given in Figure 3 with $p=0.2$ . As exemplified in the figure, $f$ is non-decreasing, and

	$\displaystyle f(\alpha)=1-f(1-\alpha),$		(17)
	$\displaystyle f(\alpha)\leq\frac{D(\alpha\\|p)}{2D(1/2\\|p)},$		(18)

where (18) is trivial for $\alpha\leq\frac{1}{2}$ , and follows easily from the convexity of $D(\cdot\|p)$ for $\alpha>\frac{1}{2}$ . Specifically, this convexity implies that the derivative of $D(\cdot\|p)$ increases for $\alpha>\frac{1}{2}$ , whereas the chain rule applied to (17) implies that the derivative of $f(\alpha)$ decreases for $\alpha>\frac{1}{2}$ . Since (18) holds with equality for $\alpha\in\big{[}p,\frac{1}{2}\big{]}$ by definition, this behavior of the derivatives implies that (18) also holds for $\alpha>\frac{1}{2}$ .

Figure 3: The function

f

used in the protocol, plotted for

p=0.2

. The dashed line represents

f(\alpha)

, while the solid line represents

\frac{D(\alpha\|p)}{2D(1/2\|p)}

In general $k\cdot f(\alpha)$ may not be an integer. When this happens, we can use $\lfloor k\cdot f(\alpha)\rfloor$ bits of $1$ followed by $k-\lfloor k\cdot f(\alpha)\rfloor$ bits of $0$ . The rounding does not affect the error analysis, so to simplify notation, we focus on the integer-valued case. See Footnote 1 on Page 1 for an additional remark on this issue.

Before proceeding with the student strategy and error analysis, we intuitively explain why sending a sorted sequence (i.e. with all the 1s sent before the $0$ s) can make decoding easier for the student. Suppose the student receives the string 111100000010. If the student knows in advance that the string sent by the teacher is sorted, they can be confident that the underlined bit is flipped. However, if this were to be an unsorted string, such corrections could not be made.

The reasoning behind our choice of $f$ will become apparent in the error analysis in the following subsection, but intuitively, it serves as a carefully-chosen middle ground between the simpler choices $f(\alpha)=\alpha$ (bearing some resemblance to simple forwarding) and $f(\alpha)=\boldsymbol{1}\{\alpha>\frac{1}{2}\}$ (bearing some resemblance to cumulative teaching).

II-C Error Analysis

Continuing the study of a single block $W$ defined according to (12), observe that we have the Markov chain

\Theta\rightarrow\alpha\rightarrow W,

(19)

and in accordance with (15), we would like the distributions $P_{W|\Theta=1}$ and $P_{W|\Theta=0}$ to be as ‘far apart’ as possible. We break the error analysis into two parts, analyzing the relations $\Theta\rightarrow\alpha$ and $\alpha\rightarrow W$ separately.

II-C1 Relating $\Theta$ and $\alpha$

Define $W_{\alpha}$ to follow the conditional distribution of $W$ given $\alpha$ . The distribution of $W_{\alpha}$ is simple; recalling that we are focusing on the case $p=q$ , we have the following:

•

The bits of $W_{\alpha}$ are independently distributed;
•

If $i\leq k\cdot f(\alpha)$ , then the $i$ -th bit is 1 with probability $1-p$ and $0$ otherwise;
•

If $i>k\cdot f(\alpha)$ , then the $i$ -th bit is 1 with probability $p$ and $0$ otherwise.

We proceed by upper bounding $\rho(W_{\alpha_{1}},W_{\alpha_{2}})$ , starting with the case that $\alpha_{1}\leq\alpha_{2}$ .

The bits up to index $k\cdot f(\alpha_{1})$ , and the bits after index $k\cdot f(\alpha_{2})$ , are all identically distributed between $W_{\alpha_{1}}$ and $W_{\alpha_{2}}$ , and thus do not provide any distinguishing power. We are left with $k\cdot f(\alpha_{2})-k\cdot f(\alpha_{1})$ bits, which are i.i.d. according to either Bernoulli( $p$ ) or Bernoulli( $1-p$ ).

The Bhattacharyya coefficient between Bernoulli( $p$ ) and Bernoulli( $1-p$ ) can easily be computed to be $2\sqrt{p(1-p)}=e^{-D(1/2\|p)}$ , and using (11), we obtain¹¹1If we were to explicitly incorporate rounding as discussed following (18), then this equation would be replaced by the slightly looser bound $\rho(W_{\alpha_{1}},W_{\alpha_{2}})\leq e^{-D(1/2\|p)\cdot(kf(\alpha_{2})-kf(\alpha_{1})-1)}$ . The subtraction of one in the exponent only amounts to multiplying the entire expression by a constant, which does not impact the resulting exponential decay rate.

\rho(W_{\alpha_{1}},W_{\alpha_{2}})=e^{-kD(1/2\|p)\cdot(f(\alpha_{2})-f(\alpha_{1}))}

(20)

To simplify the exponent, we use (17) and (18) to obtain

	$\displaystyle f(\alpha_{1})\leq\frac{D(\alpha_{1}\\|p)}{2D(1/2\\|p)},$		(21)
	$\displaystyle f(\alpha_{2})=1-f(1-\alpha_{2})\geq 1-\frac{D(1-\alpha_{2}\\|p)}{2D(1/2\\|p)},$		(22)

and hence,

f(\alpha_{2})-f(\alpha_{1})\geq 1-\frac{D(\alpha_{2}\|p)}{2D(1/2\|p)}-\frac{D(1-\alpha_{1}\|p)}{2D(1/2\|p)}.

(23)

We can therefore weaken (20) to

\rho(W_{\alpha_{1}},W_{\alpha_{2}})\leq e^{-k\cdot(D(1/2\|p)-D(\alpha_{1}\|p)/2-D(1-\alpha_{2}\|p)/2)}.

(24)

Although the assumption $\alpha_{1}\leq\alpha_{2}$ was used in the derivation of (20), a trivial argument shows that when $\alpha_{1}>\alpha_{2}$ , the right-hand side of (20) is at least one (since $f$ is non-decreasing). Since the Bhattacharyya coefficient never exceeds one, it follows that (24) remains true in the case that $\alpha_{1}>\alpha_{2}$ .

II-C2 Relating $\alpha$ and $W$

The next step is to decompose the distribution of $W$ into the simpler distributions $W_{\alpha}$ . To do so, we use the following definition.

Definition 1.

Let $A_{1},\ldots,A_{m}$ be probability distributions defined on a common finite probability space $\Omega$ and $p_{1},\ldots,p_{m}$ be real numbers in $[0,1]$ that sum to one. The mixture distribution $A$ described by $[p_{1},A_{1};\ldots;p_{k},A_{k}]$ is defined as $\mathbb{P}(A=x)=\sum_{i=1}^{k}p_{i}\cdot\mathbb{P}(A_{i}=x)$ .

We proceed with a simple technical lemma on the Bhattacharyya coefficient on mixtures.

Lemma 1.

Suppose that $A$ follows a mixture distribution described by $[p_{1},A_{1};\ldots;p_{k},A_{k}]$ . Then for any distribution $B$ , we have

\rho(A,B)\leq\sum_{i=1}^{k}\sqrt{p_{i}}\rho(A_{i},B).

(25)

Proof.

Using the inequality $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ , we have

$\displaystyle\rho(A,B)$	$\displaystyle=\sum_{x\in\Omega}\sqrt{\mathbb{P}(A=x)\mathbb{P}(B=x)}$	(26)
	$\displaystyle=\sum_{x\in\Omega}\sqrt{\sum_{i=1}^{k}p_{i}\mathbb{P}(A_{i}=x)\mathbb{P}(B=x)}$	(27)
	$\displaystyle\leq\sum_{x\in\Omega}\sum_{i=1}^{k}\sqrt{p_{i}\mathbb{P}(A_{i}=x)\mathbb{P}(B=x)}.$	(28)

Simple re-arranging gives the desired result. ∎

Using the symmetry of the Bhattacharyya coefficient and applying Lemma 1 twice, we obtain

\rho(P_{W|\Theta=1},P_{W|\Theta=0})\leq\sum_{i=0}^{k}\sum_{j=0}^{k}\sqrt{\mathbb{P}(\alpha=i/k|\Theta=1)}\\ \times\sqrt{\mathbb{P}(\alpha=j/k|\Theta=0)}\cdot\rho(W_{i/k},W_{j/k}).

(29)

Moreover, the conditional distribution of $\alpha$ given $\Theta$ is simple:

	$\displaystyle(\alpha\|\Theta=1)$	$\displaystyle\sim\frac{1}{k}\cdot\text{Binomial}(k,1-p),$		(30)
	$\displaystyle(\alpha\|\Theta=0)$	$\displaystyle\sim\frac{1}{k}\cdot\text{Binomial}(k,p).$		(31)

Hence, applying the Chernoff bound for the binomial distribution (e.g., see [9, Sec. 2.2]), we obtain

\mathbb{P}(\alpha=\alpha_{0}|\Theta=1)\leq e^{-kD(\alpha_{0}\|1-p)}=e^{-kD(1-\alpha_{0}\|p)},

(32)

and similarly,

\mathbb{P}(\alpha=\alpha_{0}|\Theta=0)\leq e^{-kD(\alpha_{0}\|p)}.

(33)

Combining these bounds with (24) and (29), we find that the $D(i/k\|p)/2$ and $D(1-j/k\|p)/2$ terms both cancel to zero, and we are left with

	$\displaystyle\rho(P_{W\|\Theta=1},P_{W\|\Theta=0})$	$\displaystyle\leq\sum_{i=0}^{k}\sum_{j=0}^{k}e^{-kD(1/2\\|p)}$		(34)
		$\displaystyle=(k+1)^{2}e^{-kD(1/2\\|p)}.$		(35)

Combining (15) with (35) gives

\mathcal{R}\geq D(1/2\|p)-\mathcal{O}\left(\frac{\log k}{k}\right),

(36)

and hence, by setting $k$ large enough, we can obtain any learning rate arbitrarily close to $D(1/2\|p)$ . This completes the proof of Theorem 1.

II-D Computational complexity

In this section, we argue that it is possible to execute the described protocol with an overall computation time of $\mathcal{O}(n)$ (i.e., no higher than that of simply reading the received bits). This is immediate for the teacher, so we focus our attention on the student.

To implement the maximum likelihood decoder, it suffices to compute the associated log likelihood ratio (LLR) $\log\frac{\mathbb{P}(\vec{Z}=\vec{z}|\Theta=1)}{\mathbb{P}(\vec{Z}=\vec{z}|\Theta=0)}$ given the $n$ received bits $\vec{z}$ . In addition, since the $n/k-1$ blocks are independent by construction and memorylessness, the overall LLR is the sum of per-block LLRs. Hence, we only need to show that each of these can be computed in time $\mathcal{O}(k)$ .

The only minor difficulty in computing the LLR for a single block is that we need to account for all possible $k+1$ choices of the latent/hidden variable $\alpha\in\big{\{}0,\frac{1}{k},\dotsc,1\big{\}}$ : Letting $W$ denote a generic length- $k$ block according to (12), we have

\mathbb{P}(W|\Theta=1)=\sum_{\alpha}\mathbb{P}(\alpha|\Theta=1)\mathbb{P}(W|\alpha),

(37)

and similarly for $\mathbb{P}(W|\Theta=0)$ . In addition, since the channel is memoryless, we have

\mathbb{P}(W|\alpha)=\prod_{i=1}^{k}\mathbb{P}(W(i)|\alpha),

(38)

where $W(i)$ is the $i$ -th bit comprising $W$ . Naively, we could evaluate (37) and (38) directly, but this would give a per-block complexity of $\mathcal{O}(k^{2})$ .

To improve this, we recall that for two values $\alpha_{1},\alpha_{2}\in\big{\{}0,\frac{1}{k},\dotsc,1\big{\}}$ , the associated values of $\mathbb{P}(W(i)|\alpha)$ only differ for values of $i$ in between $k\cdot f(\alpha_{1})$ and $k\cdot f(\alpha_{2})$ . Therefore, we can initially compute $\mathbb{P}(W|\alpha)$ for $\alpha=0$ , and then use it to compute the value for $\alpha=\frac{1}{k}$ by only looking at the first $k\cdot f\big{(}\frac{1}{k}\big{)}$ bits of $W$ , and so on for $\alpha\in\big{\{}\frac{2}{k},\frac{3}{k},\dotsc,1\big{\}}$ ; at each step, we only need to look at the bits affected by incrementing $\alpha$ . In this manner, it only takes $\mathcal{O}(k)$ time to compute all $k+1$ values of $\mathbb{P}(W|\alpha)$ . In addition, $\mathbb{P}(\alpha|\Theta=1)$ in (37) follows a scaled binomial distribution, whose probability mass function can be pre-computed in $\mathcal{O}(k)$ time (assuming constant-time arithmetic operations). Hence, we obtain an overall per-block computation time of $\mathcal{O}(k)$ .

In addition to the low computational overhead, another advantage of our strategy is that it is anytime. Specifically, it can be run without knowledge of $n$ , and we can stop the algorithm any time and ask the student for the estimate $\hat{\Theta}_{n}$ . Letting $\epsilon_{k,n}$ be the resulting error probability with block size $k$ and total time $n$ , it still holds in this scenario that $-\frac{1}{n}\ln\epsilon_{k,n}\rightarrow D(1/2||p)$ , as long as $k\rightarrow\infty$ and $n/k\rightarrow\infty$ .

III General Binary-Input Discrete Memoryless Channels

In this section, we consider the same model as Section I-A, except that we replace BSC( $p$ ) and BSC( $q$ ) by general discrete memoryless channels (DMCs) $P$ and $Q$ (known to the student and teacher), both of which have binary inputs $\{0,1\}$ . We denote the transition probabilities by $P(y|\theta)$ and $Q(z|\hat{x})$ . The case that $Q$ has non-binary input is discussed in Section IV.

We first overview the 1-hop case in Section III-A, following the classical work of Shannon, Gallager, and Berlekamp [10]. We then state an achievable learning rate in Section III-B, and give the protocol and analysis in Section III-C, which are essentially more technical variants of the binary symmetric case. We provide an upper bound on the learning rate (i.e., a converse) in Section III-D, and use it to deduce the optimal learning rate in certain special cases. While the tightness of our achievable learning rate remains open in general, we additionally show in Section III-E that, at least under certain technical assumptions, it cannot be improved for block-structured protocols.

III-A Existing Results on the 1-Hop Case

We first consider a 1-hop scenario in which a single agent makes repeated observations of the unknown variable $\Theta$ through repeated uses of a binary-input DMC $P$ . This is a well-studied problem, and we will summarize some of the results from [10].

We begin with a generalization of the Bhattacharyya coefficient. For a real number $s\in(0,1)$ and two random variables $A,B$ defined on the same finite alphabet $\Omega$ , let

\rho(A,B,s)=\sum_{x\in\Omega}\mathbb{P}(A=x)^{1-s}\mathbb{P}(B=x)^{s}\in[0,1].

(39)

where the upper bound of one follows by applying Jensen’s inequality to $\mathbb{E}_{A}\big{[}\big{(}\frac{P_{B}(B)}{P_{A}(A)}\big{)}^{s}\big{]}$ . This function can be extended to $s=0$ and $s=1$ by continuity; for instance, $\rho(A,B,0)=\sum_{x\in\Omega}\mathbb{P}(A=x)\mathbbm{1}_{\mathbb{P}(B=x)>0}$ . We also note that setting $s=\frac{1}{2}$ recovers the Bhattacharyya coefficient. We write $\rho(A,B,s)$ and $\rho(P_{A},P_{B},s)$ interchangeably. Similarly to (10)–(11), we have the following properties for independent pairs and length- $m$ i.i.d. vectors:

	$\displaystyle\rho((A_{1},A_{2}),(B_{1},B_{2}),s)=\rho(A_{1},B_{1},s)\cdot\rho(A_{2},B_{2},s),$		(40)
	$\displaystyle\rho(\vec{A},\vec{B},s)=\rho(A,B,s)^{m}.$		(41)

Given the DMC $P$ with inputs $0$ and $1$ , we define

\rho_{P}(s)=\rho(P(\cdot|0),P(\cdot|1),s),

(42)

and denote its logarithm by²²2See Figure 4 on Page 4 for some examples of how $\mu_{P}(s)$ varies as a function of $s$ .

\mu_{P}(s)=\ln\rho_{P}(s)\leq 0,

(43)

where $P(\cdot|0)$ and $P(\cdot|1)$ are treated as probability distributions on the output alphabet. The first and second derivatives of $\mu_{P}(s)$ for $s\in(0,1)$ are denoted by $\mu^{\prime}_{P}(s)$ and $\mu^{\prime\prime}_{P}(s)$ , and once again, the values for the endpoints $s\in\{0,1\}$ are defined with respect to the appropriate limits.

Suppose that the agent makes a single observation of $\Theta\in\{0,1\}$ through $P$ , and uses it to estimate $\Theta$ . Using the identity $\mathbbm{1}_{\mathbb{P}(B=x)>\mathbb{P}(A=x)}\leq\big{(}\frac{\mathbb{P}(B=x)}{\mathbb{P}(A=x)}\big{)}^{s}$ , we find that for any $s\geq 0$ , $\rho_{P}(s)$ is an upper bound to the probability of a maximum-likelihood decision rule selecting $B$ when $A$ is true. An analogous property holds for the reverse error event with $1-s$ in place of $s$ , and it follows that $\rho_{P}(s)$ is an upper bound to the overall error probability for all $s\in[0,1]$ .

In view of the property in (41), if the agent instead makes $n$ independent observations, the upper bound becomes $\rho_{P^{n}}(s)=\rho_{P}(s)^{n}$ , implying a learning rate of at least

-\frac{1}{n}\ln\rho_{P}(s)^{n}=-\mu_{P}(s).

(44)

The following well-known result reveals that this learning rate is optimal upon optimizing over $s\in[0,1]$ .

Lemma 2.

[10, Corollary of Thm. 5]³³3This is a weakened version of the statement in [10], in which we apply $\max\{s,1-s\}\leq 1$ in the coefficient to the $\sqrt{n}$ term. In addition, we specialize to the case that every transmitted symbol is identical, in order to match our setup. Let $s^{*}\in[0,1]$ be chosen to minimize $\mu_{P}(s)$ . Then any decoding strategy of the agent after $n$ uses of the channel must have an error probability of at least $\frac{1}{8}\exp\big{(}n\mu_{P}(s^{*})-\sqrt{2n\mu_{P}^{\prime\prime}(s^{*})}\big{)}$ when $\Theta$ is equiprobable on $\{0,1\}$ .

Since the $\sqrt{2n\mu_{P}^{\prime\prime}(s^{*})}$ term is sub-linear in $n$ , this immediately gives the following corollary.

Corollary 1.

The optimal learning rate for the 1-hop system is given by $-\min_{s\in[0,1]}\mu_{P}(s)$ .

III-B Achievability Result

We now state the second main result of our paper, giving an achievability result for general binary-input DMCs.

Theorem 2.

Let $P$ and $Q$ be arbitrary binary-input discrete memoryless channels, and let $\mathcal{R}^{*}(P,Q)$ be the supremum of learning rates across all teaching and learning protocols. Then

\mathcal{R}^{*}(P,Q)\geq-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)).

(45)

Moreover, this can be improved to

\mathcal{R}^{*}(P,Q)\geq-\min_{s\in[0,1]}\max\big{(}\mu_{P}(s),\min(\mu_{Q}(s),\mu_{Q}(1-s))\big{)}.

(46)

While (46) clearly implies (45), we find the expression (45) easier to work with and prove. Once (45) is proved, (46) follows from the fact that one can choose to flip the inputs of $Q$ , which amounts to replacing $\mu_{Q}(s)$ by $\mu_{Q}(1-s)$ .⁴⁴4At first this leads to the seemingly different learning rate of $-\min\big{(}\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s)),\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(1-s))\big{)}$ , but then (46) can be recovered by taking the two $\min_{s\in[0,1]}$ outside the outer minimum (swapping minimization order has no effect), and applying the identity $\max(a,\min(b,c))=\min(\max(a,b),\max(a,c))$ (verified by checking all 6 orderings of $a$ , $b$ , and $c$ ).

When the channels $P$ and $Q$ are identical, (45) reduces to the expression in Corollary 1. Since the learning rate cannot be smaller than the 1-hop case (due to a data-processing argument similar to the binary symmetric setting), we conclude that Theorem 46 provides the optimal learning rate in this case. Other scenarios in which Theorem 46 gives the optimal learning rate will be discussed in Section III-D.

III-C Protocol and Analysis for the Achievability Result

Fix an arbitrary value of $\bar{s}\in[0,1]$ and define

\mu_{\max}=\max(\mu_{P}(\bar{s}),\mu_{Q}(\bar{s})).

(47)

To prove (45), it suffices to show that $-\mu_{\max}$ is an achievable learning rate. We assume without loss of generality that $\mu_{\max}<0$ , since a learning rate of zero is trivial.

We again adopt the block structure used in Section II, with $k$ denoting the block size. Suppose that in a single block, the teacher receives $\vec{Y}=(Y_{1},Y_{2},\ldots,Y_{k})$ . We define the log-likelihood ratio (LLR) as

l(\vec{Y})=\sum_{i=1}^{k}\ln\frac{P(Y_{i}|1)}{P(Y_{i}|0)},

(48)

and let $L_{1}$ (respectively, $L_{0}$ ) be the distribution of $l(\vec{Y})$ under $\Theta=1$ (respectively, $\Theta=0$ ).

Upon receiving $\vec{Y}$ , the teacher computes $l=l(\vec{Y})$ , and sends $g(l)$ bits of $0$ followed by $k-g(l)$ bits of $1$ , where $g(l)$ is defined as follows:

g(l)=\begin{cases}\min(k,\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l))&l\leq 0\\ \max(0,k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l))&l>0\end{cases}

(49)

Since $1-\bar{s}\geq 0$ , $\mu_{\max}<0$ , and $\ln\mathbb{P}(L_{0}\geq l)\leq 0$ , we have

\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq 0,

(50)

and similarly,

\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l)\geq 0,

(51)

which implies that $0\leq g(l)\leq k$ for all $l$ . It may be necessary to round $g(l)$ to the nearest integer, but similarly to Section II, this does not affect the result, and it is omitted in our analysis.

The main difference in the analysis compared to Section II is establishing the following technical lemma.

Lemma 3.

Under the preceding setup and definitions, we have for all $l$ that

\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l).

(52)

Proof.

The proof is based on probabilistic bounds on log-likelihood ratios given in [10], along with algebraic manipulations that are elementary but rather technical; the details are given in Appendix -A. ∎

We claim that this result implies

\frac{1-\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{0}\geq l)\geq g(l)\geq k-\frac{\bar{s}}{\mu_{\max}}\ln\mathbb{P}(L_{1}\leq l).

(53)

To see this, we check the possible cases in the definition of $g(l)$ in (49). If $l\leq 0$ , then the left inequality in (53) is trivial, and the right inequality follows by considering two sub-cases: If the $\min(k,\cdot)$ in (49) is attained by $k$ , then we apply (51), and otherwise, we apply (52). The case $l>0$ is handled similarly.

From here, the analysis is similar to that of Section II. Observe that we have the Markov chain $\Theta\rightarrow l(\vec{Y})\rightarrow W$ , where $W$ denotes the $k$ symbols received by the student in a single block. Since $W$ takes on two different distributions depending on whether $\Theta=1$ or $\Theta=0$ , we may also view $W$ as the output of a ‘single-use’ discrete memoryless channel with input $\Theta$ and an output of length $k$ . With a slight abuse of notation, we define the quantities $\mu_{W}(s)$ and $\rho_{W}(s)$ according to this channel.

To upper bound $\rho_{W}(s)$ , we use the following straightforward generalization of Lemma 1.

Lemma 4.

Suppose that $A$ follows a mixture distribution described by $[p_{1},A_{1};\ldots;p_{k},A_{k}]$ . Then for any distribution $B$ , we have

\rho(A,B,s)\leq\sum_{i=1}^{k}p_{i}^{1-s}\rho(A_{i},B,s),

(54)

and

\rho(B,A,s)\leq\sum_{i=1}^{k}p_{i}^{s}\rho(B,A_{i},s).

(55)

The proof of this lemma is identical to that of Lemma 1, except that $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ is replaced by the more general $(a+b)^{s}\leq a^{s}+b^{s}$ for $s\in[0,1]$ (and similarly for $1-s$ ).

Decomposing each distribution $P_{W|\Theta=1}$ and $P_{W|\Theta=0}$ as a mixture based on the (finitely many) possible values of $l$ , and applying Lemma 55 twice, we obtain

	$\displaystyle\rho_{W}(\bar{s})$	$\displaystyle\leq\sum_{l_{0},l_{1}}\mathbb{P}(L_{0}=l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}=l_{1})^{\bar{s}}\rho(P_{W\|l_{0}},P_{W\|l_{1}},\bar{s})$		(56)
		$\displaystyle\leq\sum_{l_{0},l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W\|l_{0}}.P_{W\|l_{1}},\bar{s}).$		(57)

To help simplify this expression, we use the following consequence of (53) (again recalling $\mu_{\max}<0$ ):

\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\\ \leq\exp(g(l_{0})\cdot\mu_{\max})\exp((k-g(l_{1}))\cdot\mu_{\max}).

(58)

Suppose first that $l_{0}\leq l_{1}$ . Since $g$ is an increasing function and only the bits in between positions $g(l_{0})$ and $g(l_{1})$ differ, we have

	$\displaystyle\rho(P_{W\|l_{0}},P_{W\|l_{1}},\bar{s})$	$\displaystyle=\exp(\mu_{Q}(\bar{s})\cdot(g(l_{1})-g(l_{0})))$		(59)
		$\displaystyle\leq\exp(\mu_{\max}\cdot(g(l_{1})-g(l_{0}))),$		(60)

noting that each differing bit contributes $e^{\mu_{Q}(\bar{s})}\leq e^{\mu_{\max}}$ to the total. Combining (58) and (60), the terms containing $g(\cdot)$ cancel, and we are left with

\sum_{l_{0}\leq l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s})\\ \leq\sum_{l_{0}\leq l_{1}}\exp(k\cdot\mu_{\max}).

(61)

On the other hand, if $l_{0}>l_{1}$ , then the increasing property of $g$ gives

	$\displaystyle\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}$
	$\displaystyle\quad\leq\exp(g(l_{0})\cdot\mu_{\max})\exp((k-g(l_{1}))\mu_{\max})$		(62)
	$\displaystyle\quad\leq\exp(k\cdot\mu_{\max}),$		(63)

and combining this with the fact that $\rho\leq 1$ , we obtain

\sum_{l_{0}>l_{1}}\mathbb{P}(L_{0}\geq l_{0})^{1-\bar{s}}\mathbb{P}(L_{1}\geq l_{1})^{\bar{s}}\rho(P_{W|l_{0}},P_{W|l_{1}},\bar{s})\\ \leq\sum_{l_{0}>l_{1}}\exp(k\cdot\mu_{\max}).

(64)

By the definition of $l(\vec{Y})$ in (48), the value of $l$ only depends on the number of occurrences of each output symbol in the block of length $k$ , and not the specific order. Hence, with a finite output alphabet $\mathcal{Y}$ , the number of possible $l$ values is upper bounded by $(k+1)^{|\mathcal{Y}|}$ [11, Ch. 2], and we conclude from (57), (61) and (64) that

\mu_{W}(\bar{s})=\ln\rho_{W}(\bar{s})\leq k\cdot\mu_{\max}+\mathcal{O}(\log k).

(65)

As with (15), this implies that the overall learning rate $\mathcal{R}$ satisfies

\mathcal{R}\geq-\frac{1}{k}\mu_{W}(\bar{s})=-\mu_{\max}+\mathcal{O}\left(\frac{\log k}{k}\right).

(66)

By setting $k$ large enough, we can obtain any rate arbitrarily close to $-\mu_{\max}$ , completing the proof of Theorem 46.

III-D A Converse Result and Its Consequences

As hinted following Theorem 46, by data processing arguments, we cannot obtain a learning rate greater than

-\max\Big{(}\min_{s\in[0,1]}\mu_{P}(s),\min_{s\in[0,1]}\mu_{Q}(s)\Big{)},

(67)

and this matches Theorem 46 when $P=Q$ . More generally, we can immediately deduce that Theorem 46 gives the optimal learning rate whenever $Q$ is a degraded version of $P$ (i.e., there exists an auxiliary channel $U$ such that composing $P$ with $U$ gives an overall channel with the same transition law as $Q$ ), or vice versa. This is because the teacher (respectively, student) can artificially introduce extra noise to reduce to the case that $P=Q$ .

In this section, we give an additional upper bound on the learning rate (i.e., converse) that improves on the simple one in (67), and establishes the tightness of Theorem 46 in certain cases beyond those mentioned above. To state this converse, we introduce an additional definition. Momentarily departing from the teaching/learning setup, consider the following two-round 1-hop communication scenario defined in terms of the DMCs $P,Q$ and a parameter $\gamma\in[0,1]$ :

•

The sender seeks to communicate one of two messages, $\Theta\in\{0,1\}$ , each occurring with probability $\frac{1}{2}$ ;
•

In the first round of communication, the sender transmits $\gamma n$ symbols via independent uses of $P$ without feedback;
•

After this first round, the sender observes the corresponding $\gamma n$ output symbols via a noiseless feedback link;
•

In the second round of communication, the sender transmits $(1-\gamma)n$ symbol via independent uses of $Q$ , with no further feedback.

This is a somewhat unconventional setup with independent and non-identical channel uses, as well as partial feedback. We let $E_{2}(P,Q,\gamma)$ denote the best possible error exponent under this setup as $n\to\infty$ , with the subscript 2 representing its two-round nature.

Theorem 3.

Let $P$ and $Q$ be arbitrary binary-input discrete memoryless channels, and let $\mathcal{R}^{*}(P,Q)$ be the supremum of learning rates across all teaching and learning protocols. Then, for any $\gamma\in[0,1]$ , we have

\mathcal{R}^{*}(P,Q)\leq E_{2}(P,Q,\gamma).

(68)

Proof.

Consider a “genie-aided” setup in which the teacher and student are given the following additional information:

(i)

The student can observe the input $\hat{X}$ (rather than only the output $Z$ ) of the teacher-student channel $Q$ from time $1$ up to $\gamma n$ .
(ii)

The teacher is given the true value of $\Theta$ after time $\gamma n$ .

Since the teacher and student can always choose to ignore this additional information, any upper bound on the learning rate in this genie-aided setting is also an upper bound in the original setting.

The theorem follows by noting that this genie-aided setup can be viewed as an instance of the above two-round communication setup:

•

For the first $\gamma n$ symbols, once the student is given $\hat{X}$ , no further information is provided by $Z$ (a “degraded” version of $\hat{X}$ ), and the student’s first $\gamma n$ observed symbols correspond to repeatedly observing $\Theta$ through the channel $P$ without feedback.
•

Similarly, for the last $(1-\gamma)n$ symbols, once the teacher is given $\Theta$ , no further information is provided by $Y$ , but the value of $\Theta$ still needs to be conveyed to the student via the uses of $Q$ . Hence, the student’s final $(1-\gamma)n$ observed symbols correspond to using $Q$ with no further feedback, but with the first $\gamma n$ outputs being available at the sender (teacher).

∎

In the remainder of this subsection, we connect the converse result of Theorem 3 with the achievability result of Theorem 46 in two different ways:

•

Under a certain technical assumption (Assumption 1 below), we show that there exists $\gamma\in[0,1]$ such that the exponent in Theorem 46 equals $E_{1}(P,Q,\gamma)$ , which is defined in the same way as $E_{2}(P,Q,\gamma)$ but with no feedback being available in the communication setup. Hence, the achievability and converse only differ with respect to the availability of feedback.
•

In the case that $Q$ is a BSC, we show that Theorems 46 and 3 coincide for all $P$ , thus establishing the optimal learning rate.

Starting with the former, we consider the following technical assumption.

Assumption 1.

The DMCs $P$ and $Q$ satisfy the property that for all $s\in\big{[}0,\frac{1}{2}\big{]}$ , it holds that $\mu_{P}(s)\leq\mu_{P}(1-s)$ and $\mu_{Q}(s)\leq\mu_{Q}(1-s)$ .

This assumption states that values of $s\in\big{[}0,\frac{1}{2}\big{]}$ are uniformly “better” ( $\mu$ is more negative) than their counterparts flipped about $\frac{1}{2}$ . The opposite scenario (i.e., $\mu(s)\geq\mu(1-s)$ for all $s\in\big{[}0,\frac{1}{2}\big{]}$ ) can be handled similarly, or can more simply be viewed as reducing to Assumption 1 upon swapping the two inputs. Up to such swapping, we were unable to find any binary-input binary-output channel for which Assumption 1 fails, and even in the multiple-output case, we found that it is difficult to find counter-examples. As two concrete examples, the BSC trivially satisfies $\mu(s)=\mu(1-s)$ , and the reverse Z-channel⁵⁵5This is the regular Z-channel with the inputs swapped. with parameter $q\in(0,1)$ yields the increasing function $\mu(s)=(1-s)\ln q$ . See Figure 4 for further illustrative numerical examples with binary output.

Lemma 5.

For any binary-input DMCs $P$ and $Q$ satisfying Assumption 1, we have

\min_{\gamma\in[0,1]}E_{1}(P,Q,\gamma)=-\min_{s\in[0,1]}\max\big{(}\mu_{P}(s),\mu_{Q}(s))\big{)},

(69)

where $E_{1}(P,Q,\gamma)$ is defined in the same way as $E_{2}(P,Q,\gamma)$ above, but with no feedback being available in the communication setup.

Proof.

See Appendix -B ∎

Observe that (69) precisely equals the learning rate derived in Theorem 46, and that we would match the converse in Theorem 3 if we could additionally show that $E_{1}=E_{2}$ . Unfortunately, even the single round of feedback can strictly increase the error exponent (see Appendix -C for an example), meaning that the optimal learning rate remains unclear. On the positive side, it turns out that $E_{1}=E_{2}$ whenever $Q$ is a BSC, and in fact, in this case, we can establish that Theorem 46 gives the optimal learning rate even when Assumption 1 is dropped. Formally, we have the following.

Lemma 6.

For any binary-input DMC $P$ , if $Q$ is a BSC with crossover probability $q\in\big{(}0,\frac{1}{2}\big{)}$ , then the achievable learning rate in (45) is tight, i.e., $\mathcal{R}^{*}(P,Q)=-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s))$ .

Proof.

See Appendix -D. ∎

Using Lemma 6 as a starting point, it is straightforward to establish that Theorem 3 can indeed strictly improve on the trivial converse in (67).

Corollary 2.

There exist binary-input DMCs $P$ and $Q$ such that the optimal learning rate is strictly smaller than $-\max\big{(}\min_{s\in[0,1]}\mu_{P}(s),\min_{s\in[0,1]}\mu_{Q}(s)\big{)}$ .

Proof.

This follows by letting $P$ be a (reverse) Z-channel and $Q$ be a BSC, with suitably-chosen parameters. See Appendix -E for the details. ∎

III-E A Converse for Block-Structured Protocols

While we have established several special cases where Theorem 46 give the optimal learning rate, it remains open as to whether this is true in general. As a final result, we complement our general converse (Theorem 3) with a restricted converse that only applies to block-structured protocols. Specifically, we say that a teaching strategy is block structured with length $k$ if it follows the structure in Figure 2: For any positive integer $i$ , the bits transmitted at indices $ki+1,\dotsc,k(i+1)$ are only allowed to depend on the bits received at indices $k(i-1)+1,\dotsc,ki$ .

Theorem 4.

For any $P$ and $Q$ satisfying Assumption 1, and any protocol such that the teaching strategy is block structured with length $k=o(n)$ , the learning rate is at most $-\min_{s\in[0,1]}\max(\mu_{P}(s),\mu_{Q}(s))$ .

Proof.

See Appendix -F. ∎

This result serves as evidence that Theorem 46 may provide the optimal learning rate under Assumption 1, or at least that to have any hope of improving, the teaching strategy should have some form of long-term memory.

IV Conclusion and Discussion

We have established optimal learning rates for the problem of teaching and learning a single bit under binary symmetric noise, using a simple and computationally efficient block-structured strategy. We have also adapted this technique to general binary-input DMCs.

As discussed above, while the optimal learning rate for general binary-input DMCs follows in several cases of interest, it remains unknown in general, leaving us with the following open problem.

Open Problem 1.

Does (46) give the optimal learning rate for arbitrary binary-input DMCs $P$ and $Q$ ?

An important step towards answering this question would be removing Assumption 1 in the results that currently make use of it. Also regarding this assumption, we expect that it should hold true for all binary-input channels, and it would be of interest to establish whether this is indeed the case.

Another possible direction is to increase the input alphabet size of $Q$ , removing the assumption of binary inputs. In the 1-hop case, having a channel with more inputs does not improve the learning rate beyond choosing the two best inputs for the channel [10]. By a similar idea, we can immediately deduce an achievability result via Theorem 46, but we are left with the following open problem.

Open Problem 2.

Does there exist a pair of DMCs $P,Q$ , such that every restriction of $Q$ to two inputs gives a strictly suboptimal learning rate?

In addition to these open problems, interesting directions may include the case that the channel transition laws are unknown to the teacher and/or student, and the case of teaching more than a single bit.

-A Proof of Lemma 3 (Technical Result for General Binary-Input DMCs)

We first state an auxiliary result from [10]. When the agent seeks to decide between the alternatives $\Theta=1$ and $\Theta=0$ , the natural decision rule is to set a threshold on the corresponding log-likelihood ratio. The following result bounds the probability that this log-likelihood ratio test fails under a given threshold.

Lemma 7.

[10, Proof of Thm. 5] Let $P$ be a binary-input DMC, and for each output symbol $y$ , define

l(y)=\ln\frac{P(y|1)}{P(y|0)}

(70)

to be the log-likelihood ratio between the two conditional distributions. Let $L_{1}$ and $L_{0}$ follow the distribution of $l(Y)$ under $Y\sim P(\cdot|1)$ and $Y\sim P(\cdot|0)$ respectively. For all $s\in(0,1)$ , we have

\mathbb{P}(L_{0}\geq\mu_{P}^{\prime}(s))\leq\exp[\mu_{P}(s)-s\mu_{P}^{\prime}(s)],

(71)

and

\mathbb{P}(L_{1}\leq\mu_{P}^{\prime}(s))\leq\exp[\mu_{P}(s)+(1-s)\mu_{P}^{\prime}(s)].

(72)

We now proceed with the proof of Lemma 3. First suppose that $l$ satisfies

\lim_{s\rightarrow 0^{+}}\mu^{\prime}_{P}(s)\leq l\leq\lim_{s\rightarrow 1^{-}}\mu^{\prime}_{P}(s).

(73)

Then, since $\mu^{\prime}_{P}(s)$ is continuous and non-decreasing [10], there exists $t\in[0,1]$ such that $l=\mu_{P}^{\prime}(t)$ , and by (71) and (72) (applied to the $k$ -fold product distribution), we obtain

\ln\mathbb{P}(L_{0}\geq l)\leq k(\mu_{P}(t)-t\mu_{P}^{\prime}(t)),

(74)

and

\ln\mathbb{P}(L_{1}\leq l)\leq k(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)).

(75)

On the other hand, if

l<\lim_{s\rightarrow 0^{+}}\mu^{\prime}_{P}(s),

(76)

then we may set $t=0$ , and (74) still holds due to the fact that $\mu_{P}(0)=0$ . In this case, we have

\ln\mathbb{P}(L_{1}\leq l)\leq\ln\mathbb{P}\Big{(}L_{1}\leq\lim_{s\rightarrow 0^{+}}\mu_{P}^{\prime}(s)\Big{)}\stackrel{{\scriptstyle(a)}}{{\leq}}k\mu_{P}^{\prime}(0),

(77)

where (a) follows from fact that (75) holds when $l=\mu_{P}^{\prime}(t)$ . It follows that (75) also holds here with $t=0$ . By a similar argument, if $l>\lim_{s\rightarrow 1^{-}}\mu^{\prime}_{P}(s)$ , then we may set $t=1$ , and (74) and (75) remain valid. Therefore, for every value of $l$ (possibly $\pm\infty$ ), there exists some $t\in[0,1]$ such that equations (74) and (75) hold.

Since $\mu_{P}$ is convex (see [10, Thm. 5]), $\mu_{P}(\bar{s})$ is lower bounded by its tangent line approximation at $t$ . This implies

\mu_{P}(t)+(\bar{s}-t)\mu_{P}^{\prime}(t)\leq\mu_{P}(\bar{s})\leq\mu_{\max},

(78)

which re-arranges to give

(1-\bar{s})(\mu_{P}(t)-t\mu_{P}^{\prime}(t))\leq\mu_{\max}-\bar{s}(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)).

(79)

Multiplying through by $k>0$ and dividing through by $\mu_{\max}<0$ , we obtain

\frac{1-\bar{s}}{\mu_{\max}}k(\mu_{P}(t)-t\mu_{P}^{\prime}(t))\geq k-\frac{\bar{s}}{\mu_{\max}}k(\mu_{P}(t)+(1-t)\mu_{P}^{\prime}(t)).

(80)

Finally, further upper bounding the left-hand side via (74) and further lower bounding the right-hand side via (75), we deduce the desired inequality in (52).

-B Proof of Lemma 5 (1-Hop Exponent Without Feedback)

Let $\vec{x}$ and $\vec{x}^{\prime}$ be the two codewords in the 1-hop communication problem without feedback, let $\vec{y}$ denote the received sequence, and define the $n$ -letter version of $\mu$ as

\mu_{n,\gamma}(s)=\ln\sum_{\vec{y}}\mathbb{P}(\vec{Y}=\vec{y}|\vec{X}=\vec{x}^{\prime})^{1-s}\mathbb{P}(\vec{Y}=\vec{y}\,|\,\vec{X}=\vec{x})^{s}.

(81)

Using the classical analysis of Shannon, Gallager, and Berlekamp [10, Thm. 5] (see also Lemma 2 regarding the case $P=Q$ ), the error exponent attained by optimal (maximum-likelihood) decoding is [10, Thm. 5]⁶⁶6The $\sqrt{\mu^{\prime\prime}_{n}(s)}$ term in [10, Thm. 5] scales as $O(\sqrt{n})$ due to the additive property of $\mu$ for product distributions, and thus, it does not affect the error exponent.

\lim_{n\to\infty}-\frac{1}{n}\min_{s\in[0,1]}\mu_{n}(s),

(82)

and accordingly, we proceed by analyzing $\mu_{n}(s)$ .

Without loss of optimality, we can assume that $\vec{x}$ and $\vec{x}^{\prime}$ differ in every entry. Let $\beta_{P}$ be the fraction of uses of $P$ for which $\vec{x}$ takes value $1$ (and hence $\vec{x}^{\prime}$ takes value $0$ ), and define $\beta_{Q}$ analogously. Then, the additive property of $\mu$ for product distributions (by taking the logarithm in (40)) gives

\mu_{n,\gamma}(s)=\beta_{P}\gamma n\mu_{P}(s)+(1-\beta_{P})\gamma n\mu_{P}(1-s)\\ +\beta_{Q}(1-\gamma)n\mu_{Q}(s)+(1-\beta_{Q})(1-\gamma)n\mu_{Q}(1-s),

(83)

noting that by the definition of $\mu_{P}$ , swapping the two inputs amounts to replacing $s$ by $1-s$ .

To further simplify (83), we define

	$\displaystyle\mu^{*}=\min_{s\in[0,1]}\,\max(\mu_{P}(s),\mu_{Q}(s)),$		(84)
	$\displaystyle s^{}=\operatorname{arg\,min}_{s\in[0,1]}\,\max(\mu_{P}(s),\mu_{Q}(s)),$		(85)

and introduce the shorthand

\tilde{\mu}_{n,\gamma}(s)=\gamma n\mu_{P}(s)+(1-\gamma)n\mu_{Q}(s).

(86)

We first provide a lemma lower bounding $\tilde{\mu}_{n,\gamma}(s)$ , and then show that $\tilde{\mu}_{n,\gamma}(s)$ is itself a lower bound on $\mu_{n,\gamma}(s)$ .

Lemma 8.

There exists a choice of $\gamma\in[0,1]$ such that for all $s\in[0,1]$ , it holds that $\tilde{\mu}_{n,\gamma}(s)\geq n\mu^{*}$ .

Proof.

We split the proof into several cases. Throughout, it should be kept in mind that $\mu_{P}(s)$ is a continuous and convex function [10].

Case 1: If $\mu_{P}(s^{*})\geq\mu_{Q}(s^{*})$ and $\mu_{P}$ attains its global minimum⁷⁷7Here “global minimum” is meant with respect to the restricted domain $s\in[0,1]$ . at $s^{*}$ , then we can set $\gamma=1$ to get

	$\displaystyle\tilde{\mu}_{n,\gamma}(s)$	$\displaystyle=n\mu_{P}(s)\geq n\mu_{P}(s^{*})$
		$\displaystyle=n\cdot\max(\mu_{P}(s^{}),\mu_{Q}(s^{}))=n\mu^{*}.$		(87)

A similar argument holds when $\mu_{Q}(s^{*})\geq\mu_{P}(s^{*})$ and $\mu_{Q}$ attains its global minimum at $s^{*}$ (setting $\gamma=0$ ).

Case 2: If $\mu_{P}(s^{*})>\mu_{Q}(s^{*})$ and $\mu_{P}$ does not attain its global minimum at $s^{*}$ , then we can shift $s^{*}$ towards $P$ ’s minimizer to decrease $\max(\mu_{P}(s),\mu_{Q}(s))$ , contradicting the definition of $s^{*}$ . Hence, this case does not occur. A similar finding applies to the case that $\mu_{Q}(s^{*})>\mu_{P}(s^{*})$ and $\mu_{Q}$ does not attain its global minimum at $s^{*}$

Case 3: We claim that the above cases only leave the case that $\mu_{P}(s^{*})=\mu_{Q}(s^{*})$ and neither $\mu_{P}$ nor $\mu_{Q}$ attain their global minimum at $s^{*}$ . To see this, note that if $\mu_{P}(s^{*})\neq\mu_{Q}(s^{*})$ , then the larger one attaining its global minimum would give Case 1, whereas the larger one not attaining its global minimum would give Case 2. Hence, we may assume that $\mu_{P}(s^{*})=\mu_{Q}(s^{*})$ . Then, we can additionally assume that neither $\mu_{P}$ nor $\mu_{Q}$ attain their global minimum at $s^{*}$ , since otherwise we would again be in Case 1. We proceed with two further sub-cases.

Case 3a: If $\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})>0$ , then we can shift $s^{*}$ by a small amount and decrease $\max(\mu_{P}(s),\mu_{Q}(s))$ , contradicting the fact that $s^{*}$ is the minimizer (unless we are at an endpoint of the interval $[0,1]$ , in which case one of $\mu_{P}$ and $\mu_{Q}$ must be a global minimum, again a contradiction). Hence, this sub-case does not occur.

Case 3b: If $\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})\leq 0$ , then we choose

\gamma=\frac{\mu^{\prime}_{Q}(s^{*})}{\mu^{\prime}_{Q}(s^{*})-\mu^{\prime}_{P}(s^{*})},\quad 1-\gamma=\frac{\mu^{\prime}_{P}(s^{*})}{\mu^{\prime}_{P}(s^{*})-\mu^{\prime}_{Q}(s^{*})}.

(88)

The assumption $\mu^{\prime}_{P}(s^{*})\mu^{\prime}_{Q}(s^{*})\leq 0$ ensures that these quantities both lie in $[0,1]$ . Then, taking the derivative in (86), we obtain

\tilde{\mu}^{\prime}_{n,\gamma}(s)=(\gamma n\mu^{\prime}_{P}(s)+(1-\gamma)n\mu^{\prime}_{Q}(s))|_{s=s^{*}}=0.

(89)

Since $\tilde{\mu}_{n,\gamma}$ is convex (following from $\mu_{P}$ and $\mu_{Q}$ being convex), this implies that $\tilde{\mu}_{n,\gamma}$ must attain its global minimum at $s=s^{*}$ . In addition, since $\mu_{P}(s^{*})=\mu_{Q}(s^{*})$ , their common value must equal $\mu^{*}$ , which further implies $\tilde{\mu}_{n,\gamma}(s)\geq n\cdot\mu^{*}$ . ∎

We now relate $\mu_{n,\gamma}(s)$ to $\tilde{\mu}_{n,\gamma}(s)$ . Recall from Assumption 1 that $\mu_{P}(s)\leq\mu_{P}(1-s)$ and $\mu_{Q}(s)\leq\mu_{Q}(1-s)$ . If $s\in[0,\frac{1}{2}]$ , then substituting these into (83) gives

\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(s).

(90)

On the other hand, if $s\in[\frac{1}{2},1]$ , then Assumption 1 gives $\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(1-s)$ , and substitution into (83) gives

\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(1-s).

(91)

Combining these two cases, we have for all $s\in[0,1]$ that $\mu_{n,\gamma}(s)\geq\tilde{\mu}_{n,\gamma}(s^{\prime})$ for one of the two choices $s^{\prime}\in\{s,1-s\}$ , and hence, Lemma 8 implies there exists $\gamma\in[0,1]$ such that

\min_{s\in[0,1]}\mu_{n,\gamma}(s)\geq n\mu^{*}.

(92)

In view of the fact that (82) is the optimal error exponent, we have shown that there always exists $\gamma\in[0,1]$ such that the error exponent is upper bounded by $-\mu^{*}$ .

To complete the proof, we also need to show that for all $\gamma\in[0,1]$ , there exist choices of the codewords and $s\in[0,1]$ such that $\mu_{n,\gamma}(s)\leq n\mu^{*}$ . This is much simpler, following directly by setting $\beta_{P}=\beta_{Q}=1$ (i.e., choosing the all-zeros and all-ones codewords) in (83), setting $s=s^{*}$ , upper bounding both $\mu_{P}(s^{*})$ and $\mu_{Q}(s^{*})$ by $\mu^{*}$ .

-C A Case Where Feedback Increases the 1-Hop Exponent

Let $P$ be a BSC with crossover probability $p\in\big{(}0,\frac{1}{2}\big{)}$ , and let $Q$ be a reverse Z-channel (RZ-channel for short) with $Q(1|1)=1$ and $Q(1|0)=q\in(0,1)$ . Note that Assumption 1 holds for these channels, as discussed just above Lemma 5. Consider the following communication strategy with partial feedback:

•

Repeatedly input $\Theta$ to the BSC in the first $\gamma n$ channel uses.
•

After observing the feedback, if more than half the BSC bits are flipped, send the all- $0$ s sequence over the RZ-Channel in $(1-\gamma)n$ uses, and otherwise send the all- $1$ s sequence.

At the decoder, an initial estimate $\tilde{\Theta}$ is formed based on maximum-likelihood decoding (i.e., a majority vote) on the $\gamma n$ received BSC symbols. Then, after receiving the RZ-channel output, the decision is reversed (i.e., $\hat{\Theta}=1-\tilde{\Theta}$ ) if any $0$ s are observed among the $(1-\gamma)n$ received symbols; otherwise, the decision is kept (i.e., $\hat{\Theta}=\tilde{\Theta}$ ). Essentially, the Z-channel uses are used to tell the decoder whether the initial decision should be changed or not.

If the initial decision $\tilde{\Theta}$ is correct, then the final decision is correct with conditional probability one, since the all $1$ s sequence is received noiselessly. On the other hand, if the initial decision is incorrect, then the probability that it fails to be reversed is $q^{(1-\gamma)n}$ . Hence, an error only occurs if both steps fail, and the overall error exponent $E$ of this strategy is given by

E=\gamma E_{\rm BSC}+(1-\gamma)\ln\frac{1}{q},

(93)

where $E_{\rm BSC}=D\big{(}\frac{1}{2}\,\big{\|}p\big{)}=\frac{1}{2}\ln\big{(}\frac{1}{4p(1-p)}\big{)}$ is the optimal error exponent for transmitting two messages over a BSC (see (5)).

In Figure 5, we compare the exponent in (93) to that without feedback (Lemma 5), setting $p=0.2$ and $q=0.8$ , which turns out to make the optimal exponents of $P$ and $Q$ identical. In this example, we see that feedback strictly increases the exponent for all $\gamma\in(0,1)$ . Intuitively, the improvement comes from being able to “reverse” initial “bad events” to turn them into good events, which is not possible in the absence of feedback.

-D Proof of Lemma 6 (Tightness for Symmetric $Q$ )

Throughout the proof, we let $\mathbb{P}(\vec{y}|0)$ be a shorthand for $\mathbb{P}(\vec{Y}=\vec{y}|\Theta=0)$ , and similarly for $\mathbb{P}(\vec{y}|1)$ and other analogous quantities.

We make use of Theorem 3, and accordingly consider a (feedback) communication scenario with $\gamma n$ uses of $P$ and $(1-\gamma)n$ uses of $Q$ . Since we assume $\Theta$ to be equiprobable on $\{0,1\}$ , the optimal decoding rule given a received sequence $\vec{y}$ is to set $\hat{\Theta}=1$ if $\mathbb{P}(\vec{Y}=\vec{y}|\Theta=1)>\mathbb{P}(\vec{Y}=\vec{y}|\Theta=0)$ , and $\hat{\Theta}=0$ otherwise. Under this optimal rule, the error probability $P_{\rm e}=\mathbb{P}(\hat{\Theta}\neq\Theta)$ is given by [10, Eq. (3.2)]

\displaystyle P_{\rm e}=\frac{1}{2}\sum_{\vec{y}}\min(\mathbb{P}(\vec{y}|0),\mathbb{P}(\vec{y}|1)).

(94)

We momentarily consider the case that there is no feedback, in which we can simply define $\vec{x}$ and $\vec{x}^{\prime}$ to be the two length- $n$ codewords (one per message), and choose these codewords to minimize the error probability, yielding

\displaystyle P_{\rm e}^{\text{no-fb}}=\frac{1}{2}\min_{\vec{x},\vec{x}^{\prime}}\sum_{\vec{y}}\min(\mathbb{P}(\vec{y}|\vec{x}),\mathbb{P}(\vec{y}|\vec{x}^{\prime})).

(95)

Since this expression is still exactly equal to smallest possible error probability, its corresponding exponent is $E_{1}(P,Q,\gamma)$ , defined analogously to $E_{2}(P,Q,\gamma)$ but without feedback.

In the two-round setting with feedback considered in Theorem 3, slightly more care is needed. We proceed by explicitly splitting up the codeword and received sequence as $\vec{x}=(\vec{x}_{P},\vec{x}_{Q})$ and $\vec{y}=(\vec{y}_{P},\vec{y}_{Q})$ , and writing the following for $\theta\in\{0,1\}$ :

\mathbb{P}(\vec{y}|\theta)=P^{\gamma n}(\vec{y}_{P}|\vec{x}_{P}(\theta))Q^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}(\vec{y}_{P},\theta)),

(96)

where $\vec{x}_{Q}(\vec{y}_{P},\theta)$ explicitly writes $\vec{x}_{Q}$ as a function of $\vec{y}_{P}$ and $\theta$ , and similarly for $\vec{x}_{P}=\vec{x}_{P}(\theta)$ . Substituting into (94), we obtain

	$\displaystyle P_{\rm e}=\frac{1}{2}\sum_{\vec{y}_{P},\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P}(0))Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}(\vec{y}_{P},0)),$
	$\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P}(1))Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}(\vec{y}_{P},1))\big{)}$		(97)
	$\displaystyle\geq\frac{1}{2}\min_{\vec{x}_{P},\vec{x}^{\prime}_{P}}\sum_{\vec{y}_{P}}\min_{\vec{x}_{Q},\vec{x}^{\prime}_{Q}}\sum_{\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}),$
	$\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}\|\vec{x}^{\prime}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}^{\prime}_{Q})\big{)}.$		(98)

We now observe that this expression coincides with the non-feedback case given in (95), except that the minimum over $(\vec{x}_{Q},\vec{x}^{\prime}_{Q})$ comes after the summation over $\vec{y}_{P}$ (since the former is allowed to depend on the latter).

While this ordering can be significant in general, we now show that it has no effect when $Q$ is a BSC. Without loss of optimality, we can assume that $\vec{x}_{Q}$ and $\vec{x}^{\prime}_{Q}$ differ in every entry, since any entries that coincide would be independent of $\Theta$ (at least conditionally given $\vec{y}_{P}$ ) and reveal no information (or viewed differently, could be simulated at the decoder). However, when $Q$ is a BSC, its symmetry implies that any quantity of the form

\sum_{\vec{y}_{Q}}\min(aQ^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}_{Q}),bQ^{(1-\gamma)n}(\vec{y}_{Q}|\vec{x}^{\prime}_{Q}))

(99)

(with constants $a$ and $b$ ) is identical for all such $(\vec{x}_{Q},\vec{x}^{\prime}_{Q})$ pairs; the precise choice only amounts to re-ordering terms in the summation over $\vec{y}_{Q}$ . It follows that interchanging the order of $\sum_{\vec{y}_{P}}$ and $\min_{\vec{x}_{Q},\vec{x}^{\prime}_{Q}}$ in (98) has no impact, and we recover the non-feedback expression (95), whose error exponent is defined to be $E_{1}(P,Q,\gamma)$ .

It only remains to show that the error exponent $E_{1}(P,Q,\gamma)$ is no higher than the learning rate in (45). To show this, we re-use some of the findings from Appendix -B.⁸⁸8Assumption 1 was not applied until after Lemma 8 in Appendix -B, and we will not use it here. In particular, the error exponent is still given by (82), and $\mu_{n,\gamma}(s)$ still satisfies (83), assuming without loss of optimality that the two codewords $\vec{x}$ and $\vec{x}^{\prime}$ differ in every entry.

For any fixed $s$ , we can make $\mu_{n,\gamma}(s)$ in (83) smaller or equal by choosing both $\beta_{P}$ and $\beta_{Q}$ to be either zero or one (e.g., if $\mu_{P}(s)<\mu_{P}(1-s)$ then we set $\beta_{P}=1$ ). That is, without loss of optimality, every input to $P$ is identical, and every input to $Q$ is identical. Assuming without loss of generality that $\beta_{P}=1$ , and noting that the symmetry of the BSC implies $\mu_{Q}(s)=\mu_{Q}(1-s)$ , it follows that we can simplify (83) to

\mu_{n,\gamma}(s)=\gamma n\mu_{P}(s)+(1-\gamma)n\mu_{Q}(s).

(100)

This precisely coincides with the quantity $\tilde{\mu}_{n,\gamma}(s)$ introduced in (86), and hence, we can directly apply Lemma 8 to obtain $\mu_{n,\gamma}(s)\geq n\mu^{*}$ for some $\gamma\in[0,1]$ and all $s\in[0,1]$ , where $\mu^{*}$ is defined in (84). Since $\mu^{*}$ coincides with (45), this completes the proof.

-E Proof of Corollary 2 (Strict Improvement in the Converse)

To see that the converse in Theorem 3 can strictly improve on the trivial one in (67), we re-use the example shown in Figure 5 in Appendix -C, but with $P$ and $Q$ interchanged (i.e., $P$ is a reverse Z-channel with parameter $0.8$ , and $Q$ is a BSC with parameter $0.2$ ). In this case, it is straightforward to verify that the “No Feedback” curve in Figure 5 is simply flipped with respect to the mid-point $\gamma=\frac{1}{2}$ , and in particular, all values of $\gamma\in(0,1)$ still give a strictly smaller exponent than the endpoints $\gamma=0$ and $\gamma=1$ .

The key difference due to interchanging $P$ and $Q$ is that in view of the above proof of Lemma 6, the feedback no longer helps, and hence, the same curve just discussed serves as a valid converse even with the single round of feedback. Hence, by Theorem 3 and the fact the simple converse in (67) corresponds to taking the better of $\gamma=0$ and $\gamma=1$ , we conclude that a strict improvement has indeed been shown.

While the reliance of the preceding proof on a numerical evaluation can be circumvented, we omit such an approach, since we believe that this example has no ambiguity with respect to potential numerical issues.

-F Proof of Theorem 4 (Converse for General Block-Structured Protocols)

We first consider a single block. Letting $Y$ denote the $k$ symbols received by the teacher, $\hat{X}$ denote the $k$ transmitted bits, and $W$ denote the block received by the student (here we omit the vector notation $\vec{(\cdot)}$ to lighten the exposition), we have the Markov chain $\Theta\rightarrow Y\rightarrow\hat{X}\rightarrow W$ .

We begin with an inequality on mixture distributions (cf., Definition 1). Let $A$ and $B$ be probability distributions, and suppose $A$ follows a mixture distribution described by $[p_{1},A_{1};\ldots;p_{k},A_{k}]$ . Then, observe that

$\displaystyle\rho(A,B,s)$	$\displaystyle=\sum_{x}\bigg{(}\sum_{i=1}^{m}p_{i}\mathbb{P}(A_{i}=x)\bigg{)}^{1-s}\mathbb{P}(B=x)^{s}$	(101)
	$\displaystyle\geq\sum_{i=1}^{m}p_{i}\sum_{x}\mathbb{P}(A_{i}=x)^{1-s}\mathbb{P}(B=x)^{s}$	(102)
	$\displaystyle\geq\min_{i=1,\dotsc,m}\rho(A_{i},B,s),$	(103)

where (102) applies Jensen’s inequality to the concave function $f(z)=z^{1-s}$ ( $s\in[0,1]$ ), and (103) lower bounds the average by the minimum. The same argument applies when $A$ is fixed and $B$ is a mixture.

For convenience, define $\mu(A,B,s)=\ln\rho(A,B,s)$ . Considering two length- $k$ sequences $\vec{x},\vec{x}^{\prime}$ , by the additive property of $\mu$ (i.e., multiplicative property of $\rho$ ) for product distributions, the corresponding conditional distributions for $W$ satisfy

\mu(P_{W|\hat{X}=\vec{x}},P_{W|\hat{X}=\vec{x}^{\prime}},s)\geq k_{1}\mu_{Q}(s)+k_{2}\mu_{Q}(1-s),

(104)

where $k_{1}$ is the number of indices where $\vec{x}$ is zero and $\vec{x}^{\prime}$ is one, and vice versa for $k_{2}$ .

Supposing first that $s\in[0,\frac{1}{2}]$ , we can use the assumption $\mu_{Q}(s)\leq\mu_{Q}(1-s)$ from Assumption 1, and weaken (104) to

\mu(P_{W|\hat{X}=\vec{x}},P_{W|\hat{X}=\vec{x}^{\prime}},s)\geq k\mu_{Q}(s)

(105)

since $k_{1}+k_{2}\leq k$ . Observe that (105) gives a lower bound that holds uniformly with respect to $\vec{x},\vec{x}^{\prime}$ . Then, using the property (103) and the fact that $P_{W|\Theta=0}$ and $P_{W|\Theta=1}$ are mixture distributions mixed by $\hat{X}$ , we see that (105) implies (for $s\in[0,\frac{1}{2}]$ ) that

\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\mu_{Q}(s).

(106)

This gives us a $Q$ -dependent lower bound, and the $P$ -dependent lower bound turns out to be simpler: Using the Markov chain $\Theta\rightarrow Y\rightarrow\hat{X}$ , along with the data processing inequality for $\mu(\cdot,\cdot,s)$ ,⁹⁹9The data processing inequality for $\mu$ is equivalent to that for Rényi divergence, defined as $D_{\alpha}(P\|Q)=\frac{1}{\alpha-1}\ln\sum_{x}P(x)^{\alpha}Q(x)^{1-\alpha}$ . For the data processing inequality for Rényi divergence, see for example [12, Example 2]. we have

\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\mu(P_{Y|\Theta=0},P_{Y|\Theta=1},s)=k\mu_{P}(s).

(107)

Combining the lower bounds, we have for all $s\in[0,\frac{1}{2}]$ that

\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\max(\mu_{P}(s),\mu_{Q}(s)).

(108)

If $s\in[\frac{1}{2},1]$ , then Assumption 1 gives $\mu(s)\geq\mu(1-s)$ , and the analog of (106) is

\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\mu_{Q}(1-s).

(109)

Since (107) holds for all $s\in[0,1]$ , it follows that for any $s\in[\frac{1}{2},1]$ , we have

\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq k\max(\mu_{P}(1-s),\mu_{Q}(1-s)).

(110)

Combining the cases $s\in[0,\frac{1}{2}]$ and $s\in[\frac{1}{2},1]$ , we deduce that

\min_{s\in[0,1]}\mu(P_{W|\Theta=0},P_{W|\Theta=1},s)\geq\min_{s\in[0,\frac{1}{2}]}k\max(\mu_{P}(s),\mu_{Q}(s)).

(111)

Having studied a single block, we are now in a position to study the overall combination of $\frac{n}{k}-1$ blocks,¹⁰¹⁰10The first block received by the student does not depend on $\Theta$ , and provides no distinguishing power. defining the $n$ -letter version of $\mu$ as

\mu_{n}(s)=\ln\sum_{\vec{z}}\mathbb{P}(\vec{Z}=\vec{z}\,|\,\Theta=0)^{1-s}\mathbb{P}(\vec{Z}=\vec{z}\,|\,\Theta=1)^{s},

(112)

where $\vec{Z}$ is the entire length- $n$ sequence received by the student. Assuming momentarily that the teacher employs an identical strategy within each block, the additive property of $\mu$ for product distributions (arising from $P$ and $Q$ being memoryless) gives

\mu_{n}(s)=\left(\frac{n}{k}-1\right)\mu(P_{W|\Theta=0},P_{W|\Theta=1},s),

(113)

and substituting (111) gives

\min_{s\in[0,1]}\mu_{n}(s)\geq\min_{s\in[0,\frac{1}{2}]}(n-k)\max(\mu_{P}(s),\mu_{Q}(s)).

(114)

In fact, this lower bound holds even if the teacher applies a different strategy between blocks, since we showed that (111) holds regardless of the strategy used within the block.

Letting $s^{*}=\operatorname*{arg\,min}_{s\in[0,1]}\mu_{n}(s)$ , we have from [10, Thm. 5] (a more general form of Lemma 2) that, for any decoding rule employed by the student, the error probability is lower bounded by

\mathbb{P}(\hat{\Theta}_{n}\neq\Theta)\geq\frac{1}{8}\exp(\mu_{n}(s^{*})-\sqrt{2\mu^{\prime\prime}_{n}(s^{*})}),

(115)

where $\mu^{\prime\prime}_{n}(s)$ denotes the second derivative when $s\in(0,1)$ , and the appropriate limiting value when $s\in\{0,1\}$ . Again using the additive property of $\mu$ , the following holds if the teacher employs the same per-block strategy in each block:

\mu_{n}^{\prime\prime}(s)=\Big{(}\frac{n}{k}-1\Big{)}\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s).

(116)

Once again, our analysis also extends immediately to the case of varying per-block strategies.

In Lemma 9 below, we show that $\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s)=O(k^{2})$ , and substitution into (116) gives $\mu^{\prime\prime}_{n}(s)=O(nk)$ . The assumption $k=o(n)$ then yields $\mu^{\prime\prime}_{n}(s)=o(n^{2})$ , or $\sqrt{\mu^{\prime\prime}_{n}(s)}=o(n)$ . Substituting this scaling into (115), applying (114), and taking suitable limits, we obtain

\limsup_{n\rightarrow\infty}\,-\frac{1}{n}\ln\mathbb{P}(\Theta_{n}\neq\Theta)\leq-\min_{s\in[0,\frac{1}{2}]}\max(\mu_{P}(s),\mu_{Q}(s)).

(117)

We deduce Theorem 4 by further upper bounding the right-hand side by expanding the minimum from $[0,\frac{1}{2}]$ to $[0,1]$ ; (117) additionally reveals that further restricting $s\in[0,\frac{1}{2}]$ is without loss of optimality, which is unsurprising given Assumption 1.

It only remains to show the following.

Lemma 9.

For any fixed $P$ and $Q$ , and any $s\in[0,1]$ , it holds that $\mu^{\prime\prime}(P_{W|\Theta=0},P_{W|\Theta=1},s)=O(k^{2})$ .

Proof.

To reduce notation, define $P_{0}=P_{W|\Theta=0}$ and $P_{1}=P_{W|\Theta=1}$ . It is shown in [10, p. 85] that $\mu^{\prime\prime}$ can be written as the variance of a log-likelihood ratio:

\mu^{\prime\prime}(P_{0},P_{1},s)={\rm Var}\bigg{[}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\bigg{]},

(118)

where $\tilde{W}$ is distributed according to the following “tilted” distribution when $s\in(0,1)$ :

\tilde{P}(\vec{w})=\frac{P_{0}(\vec{w})^{1-s}P_{1}(\vec{w})^{s}}{\sum_{\vec{w}^{\prime}}P_{0}(\vec{w}^{\prime})^{1-s}P_{1}(\vec{w}^{\prime})^{s}}.

(119)

In addition, when $s$ equals an endpoint (i.e., 0 or 1), we can use the same expression for $\tilde{P}$ , except that $\vec{w}$ must be restricted to satisfy both $P_{0}(\vec{w})>0$ and $P_{1}(\vec{w})>0$ (with $\tilde{P}(\vec{w})=0$ otherwise) [10].

We upper bound the variance by the second moment, i.e., $\mathbb{E}\big{[}\big{(}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\big{)}^{2}\big{]}$ , and proceed by further upper bounding the latter. By the definition of $\tilde{P}$ , any $\vec{w}$ such that $P_{0}(\vec{w})=0$ or $P_{1}(\vec{w})=0$ does not contribute to the second moment. On the other hand, for $\theta\in\{0,1\}$ , if $P_{\theta}(\vec{w})\neq 0$ , then we have

	$\displaystyle P_{\theta}(\vec{w})$	$\displaystyle=\sum_{\vec{y}}P^{k}(\vec{y}\|\theta)Q^{k}(\vec{w}\|\vec{\hat{x}}(\vec{y}))$		(120)
		$\displaystyle\geq P_{\min}^{k}Q_{\min}^{k},$		(121)

where $P^{k}$ is the $k$ -fold product of $P$ (and similarly for $Q^{k}$ ), $\vec{\hat{x}}(\vec{y})$ is the transmitted $\hat{X}$ sequence when the teacher receives $\vec{y}$ , and $P_{\min},Q_{\min}$ are the smallest non-zero transition probabilities of $P$ and $Q$ . It follows that each non-zero $P_{\theta}(\vec{w})$ is bounded between $P_{\min}^{k}Q_{\min}^{k}$ and $1$ , which implies that $\log\frac{P_{1}(\vec{w})}{P_{0}(\vec{w})}=O(k)$ , and hence $\mathbb{E}\big{[}\big{(}\log\frac{P_{1}(\tilde{W})}{P_{0}(\tilde{W})}\big{)}^{2}\big{]}=O(k^{2})$ .

∎

Acknowledgment

This work was supported by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.

References

[1] V. Jog and P. L. Loh, “Teaching and learning in uncertainty,” IEEE Transactions on Information Theory, vol. 67, no. 1, pp. 598–615, 2021.
[2] W. Huleihel, Y. Polyanskiy, and O. Shayevitz, “Relaying one bit across a tandem of binary-symmetric channels,” IEEE International Symposium on Information Theory (ISIT), 2019.
[3] S. Rajagopalan and L. Schulman, “A coding theorem for distributed computation,” ACM Symposium on Theory of Computing, 1994.
[4] X. Vives, “How fast do rational agents learn?” The Review of Economic Studies, vol. 60, no. 2, pp. 329–347, 1993.
[5] A. Jadbabaie, P. Molavi, and A. Tahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks,” Columbia Business School Research Paper, no. 13-28, 2013.
[6] P. Molavi, A. Tahbaz-Salehi, and A. Jadbabaie, “Foundations of non-bayesian social learning,” Columbia Business School Research Paper, no. 15-95, 2017.
[7] M. Harel, E. Mossel, P. Strack, and O. Tamuz, “Rational Groupthink,” The Quarterly Journal of Economics, vol. 136, no. 1, pp. 621–668, 07 2020.
[8] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge university press, 2011.
[9] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.
[10] C. Shannon, R. Gallager, and E. Berlekamp, “Lower bounds to error probability for coding on discrete memoryless channels. i,” Information and Control, vol. 10, no. 1, p. 65–103, 1967.
[11] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011.
[12] T. van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.

Yan Hao Ling received the B.Comp. degree in computer science and the B.Sci. degree in mathematics from the National University of Singapore (NUS) in 2021. He is now a PhD student in the Department of Computer Science at NUS. His research interests are in the areas of information theory, statistical learning, and theoretical computer science.

Jonathan Scarlett (S’14 – M’15) received the B.Eng. degree in electrical engineering and the B.Sci. degree in computer science from the University of Melbourne, Australia. From October 2011 to August 2014, he was a Ph.D. student in the Signal Processing and Communications Group at the University of Cambridge, United Kingdom. From September 2014 to September 2017, he was post-doctoral researcher with the Laboratory for Information and Inference Systems at the École Polytechnique Fédérale de Lausanne, Switzerland. Since January 2018, he has been an assistant professor in the Department of Computer Science and Department of Mathematics, National University of Singapore. His research interests are in the areas of information theory, machine learning, signal processing, and high-dimensional statistics. He received the Singapore National Research Foundation (NRF) fellowship, and the NUS Early Career Research Award.

	$\displaystyle\rho(P_{W\|\Theta=1},P_{W\|\Theta=0})$	$\displaystyle\leq\sum_{i=0}^{k}\sum_{j=0}^{k}e^{-kD(1/2\\|p)}$		(34)
		$\displaystyle=(k+1)^{2}e^{-kD(1/2\\|p)}.$		(35)

	$\displaystyle P_{\rm e}=\frac{1}{2}\sum_{\vec{y}_{P},\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P}(0))Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}(\vec{y}_{P},0)),$
	$\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P}(1))Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}(\vec{y}_{P},1))\big{)}$		(97)
	$\displaystyle\geq\frac{1}{2}\min_{\vec{x}_{P},\vec{x}^{\prime}_{P}}\sum_{\vec{y}_{P}}\min_{\vec{x}_{Q},\vec{x}^{\prime}_{Q}}\sum_{\vec{y}_{Q}}\min\big{(}P^{\gamma n}(\vec{y}_{P}\|\vec{x}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}_{Q}),$
	$\displaystyle\qquad\qquad P^{\gamma n}(\vec{y}_{P}\|\vec{x}^{\prime}_{P})Q^{(1-\gamma)n}(\vec{y}_{Q}\|\vec{x}^{\prime}_{Q})\big{)}.$		(98)

Optimal Rates of Teaching and Learning Under Uncertainty

Abstract

I Introduction

I-A Model and Definitions

I-B Existing Results

II Optimal Rate Under Binary Symmetric Noise

II-A Statement of Optimal Learning Rate

Theorem 1.

II-B Teacher and Student Protocol

II-B1 Statistical Tools and Student Decoding Rule

II-B2 Block-Structured Design and Teaching Strategy

Remark 1.

II-B3 Protocol within Each Block

II-C Error Analysis

II-C1 Relating Θ\Theta and α\alpha

II-C2 Relating α\alpha and WW

Definition 1.

Lemma 1.

Proof.

II-D Computational complexity

III General Binary-Input Discrete Memoryless Channels

III-A Existing Results on the 1-Hop Case

Lemma 2.

Corollary 1.

III-B Achievability Result

Theorem 2.

III-C Protocol and Analysis for the Achievability Result

Lemma 3.

Proof.

Lemma 4.

III-D A Converse Result and Its Consequences

Theorem 3.

Proof.

Assumption 1.

Lemma 5.

Proof.

Lemma 6.

Proof.

Corollary 2.

Proof.

III-E A Converse for Block-Structured Protocols

Theorem 4.

Proof.

IV Conclusion and Discussion

Open Problem 1.

Open Problem 2.

-A Proof of Lemma 3 (Technical Result for General Binary-Input DMCs)

Lemma 7.

-B Proof of Lemma 5 (1-Hop Exponent Without Feedback)

Lemma 8.

Proof.

-C A Case Where Feedback Increases the 1-Hop Exponent

-D Proof of Lemma 6 (Tightness for Symmetric QQ)

-E Proof of Corollary 2 (Strict Improvement in the Converse)

-F Proof of Theorem 4 (Converse for General Block-Structured Protocols)

Lemma 9.

Proof.

Acknowledgment

References

Optimal Rates of Teaching and
Learning Under Uncertainty

II-C1 Relating $\Theta$ and $\alpha$

II-C2 Relating $\alpha$ and $W$

-D Proof of Lemma 6 (Tightness for Symmetric $Q$ )