Optimal Rates of Teaching and
Learning Under Uncertainty
Abstract
In this paper, we consider a recently-proposed model of teaching and learning under uncertainty, in which a teacher receives independent observations of a single bit corrupted by binary symmetric noise, and sequentially transmits to a student through another binary symmetric channel based on the bits observed so far. After a given number of transmissions, the student outputs an estimate of the unknown bit, and we are interested in the exponential decay rate of the error probability as increases. We propose a novel block-structured teaching strategy in which the teacher encodes the number of 1s received in each block, and show that the resulting error exponent is the binary relative entropy , where and are the noise parameters. This matches a trivial converse result based on the data processing inequality, and settles two conjectures of [Jog and Loh, 2021] and [Huleihel, Polyanskiy, and Shayevitz, 2019]. In addition, we show that the computation time required by the teacher and student is linear in . We also study a more general setting in which the binary symmetric channels are replaced by general binary-input discrete memoryless channels. We provide an achievability bound and a converse bound, and show that the two coincide in certain cases, including (i) when the two channels are identical, and (ii) when the student-teacher channel is a binary symmetric channel. More generally, we give sufficient conditions under which our learning rate is the best possible for block-structured protocols.
I Introduction
In several societal and technological domains, one is interested in how agents interact with their environment and with each other to attain goals such as learning information about the environment, conveying this information to other agents, and reaching a common consensus. While a comprehensive theoretical understanding of such problems would likely require highly sophisticated mathematical models, even the simplest models come with unique insights and challenges.
In this paper, we study a recently-proposed model of teaching and learning under uncertainty [1, 2], in which a teacher observes noisy information regarding an unknown 1-bit quantity (the environment), and seeks to convey information to a student via a noisy channel to facilitate learning . We establish the optimal learning rate (i.e., exponential decay of the error probability) of this problem, thereby settling two notable conjectures made in [1, 2] described in detail below. In addition, we study a generalization of the problem to more general binary-input discrete memoryless channels, and obtain the optimal learning rate in certain special cases of interest.
I-A Model and Definitions

We first formalize the model, which is depicted in Figure 1. The unknown quantity of interest is a binary random variable that takes either value in , each with probability .
At each time step , the teacher observes through BSC(), a binary symmetric channel with error rate :
(1) |
The binary symmetric channel is memoryless, meaning that the ’s are conditionally independent given . In Section III, we will study more general and possibly non-symmetric binary-input discrete memoryless channels.
At time , the teacher then transmits binary information to the student through another binary symmetric channel BSC(), where . Thus, denoting the student’s observations by , we have
(2) |
and the are conditionally independent given . Importantly, must only be a function of ; the teacher cannot look into the future.
At time , having received , the student makes an estimate of , which we denote by (or sometimes simply ). Then, the learning rate is defined as
(3) |
We assume that the noise parameters and are known to both the teacher and student; removing this assumption may be an interesting direction for future work.
I-B Existing Results
The preceding motivation and setup follows that of Jog and Loh [1], and the same problem was also considered with different motivation and terminology by Huleihel, Polyanskiy, and Shayevitz [2], who framed the problem using the terminology of relay channels. While relay channels are already a central topic in information theory, the authors of [2] further motivated their study via the more recent information velocity problem, which was posed by Yury Polyanskiy [2] and is also captured under a general framework studied by Rajagopalan and Schulman [3]. In that problem, the goal is to reliably transmit a single bit over a long chain of relays while maintaining a non-vanishing ratio between the number of hops and the total transmission time. Below we will discuss a conjecture (for the 2-hop setting) made in [2] that is directly inspired by the connection with information velocity.
It is known that the student’s optimal strategy is the maximum likelihood decoder (assuming the teacher’s strategy is known) [2, 1]. In [1], it was discussed that optimal strategies can be difficult to analyze in general, and accordingly, the following two simpler student strategies were considered:
-
•
Majority rule: is chosen according to the majority of the observations .
-
•
-majority rule: is the majority of the last observations for some , with the rough idea being to allow time for the teacher to learn first.
These student strategies were analyzed together with the following teacher strategies:
-
•
Simple forwarding: The teacher directly forwards its observation at each time step, i.e., .
-
•
Cumulative teaching: The teacher transmits its best estimate of at each time step, i.e., is the majority value among .
It was demonstrated in [1] that none of these strategies uniformly outperform one an another, and that any combination of them falls short of the upper bound (i.e., converse) based on data processing arguments (here and subsequently, denotes the relative entropy between Bernoulli distributions with parameters an ). The optimal learning rate was left as an open problem, and it was conjectured that at least part of the weakness is in the upper bound, i.e., that one cannot attain a rate of .
In [2], both simple forwarding and cumulative teaching (i.e., relaying) were considered, along with a block-structured variant that only updates the majority every time steps. These strategies were analyzed alongside the optimal maximum-likelihood learning (i.e., decoding) rule. While the learning rates attained again fall short of the upper bound, it was shown that one comes within a factor when and (i.e., the high noise setting). It was conjectured in [2] that using a more sophisticated protocol, this constant could be made arbitrarily close to one in this high-noise limit. As hinted above, the motivation for this conjecture was the fact that positive information velocity is known to be attainable; the authors of [2] discuss that for this to be possible, the 1-hop and 2-hop exponents should match in the high-noise limit so that “information propagation does not slow down”.
II Optimal Rate Under Binary Symmetric Noise
In this section, we focus on the above-described setup in which both channels in the system are binary symmetric channels. We turn to more general binary-input discrete memoryless channels in Section III.
II-A Statement of Optimal Learning Rate
Our first main result is stated as follows.
Theorem 1.
Under the preceding setup, for any , let be the supremum of learning rates across all teaching and learning protocols. Then
(4) |
where denotes the binary relative entropy function (in base ).
We achieve this result via a novel block-structured teaching strategy. In contrast with [2], the teacher transmits bits based on a single block at a time, ignoring earlier blocks. The teacher encodes the number of 1s received in the block by sending a sorted sequence (i.e., a string of s followed by a string of s), with the transition point chosen according to a carefully-chosen non-linear function of the number of 1s observed. The student performs maximum-likelihood decoding, and we study the error probability via the Bhattacharyya coefficient. The details are given in Sections II-B and II-C, and in Section II-D we show that the overall computation time required by the teacher and student is linear in .
Comparison to upper bound (converse). If , then the optimal learning rate is given by the majority decoder, which achieves a learning rate of [1, 2]
(5) |
Through data processing inequalities, we have , and an analogous argument yields . We therefore obtain the upper bound
(6) |
Theorem 1 shows that this bound is tight, disproving the above-mentioned conjecture of [1], and both confirming and strengthening the conjecture of [2].
In the remainder of the section, we only consider the case . This is without loss of generality, since in the general case, one can simulate additional noise into the system to make the noise parameters both equal to .
II-B Teacher and Student Protocol
Before describing the protocol, we first need to introduce some statistical tools.
II-B1 Statistical Tools and Student Decoding Rule
Let and be discrete random variables defined on some finite alphabet . The Bhattacharyya coefficient is a real number in given by
(7) |
Overloading notation, we will sometimes also denote this quantity by , where and are the associated probability mass functions.
The Bhattacharyya coefficient measures the ‘closeness’ between the distributions of and . Values close to 0 indicate easily separated distributions, while values close to 1 indicate very similar distributions. In particular, if and only if and have disjoint supports, and if and only if they are identically distributed.
We will make use of some standard properties of the Bhattacharyya coefficient:
-
•
Suppose that we have a random variable which is known to follow one of the two known distributions, or . If we draw one instance of and use it to decide which distribution came from, then there exists a strategy to achieve an error probability of at most . To achieve this, we simply use the maximum likelihood test: Given , we decide that is the true distribution if , and decide that is the true distribution otherwise. The probability of selecting distribution when is indeed the true distribution is given by
(8) The other error type is upper bounded similarly, and combining the two error types yields the desired bound:
(9) - •
- •
In accordance with (9), we assume throughout the paper that the student adopts the optimal maximum-likelihood decoding rule, where the two underlying distributions are those of vs. given .
II-B2 Block-Structured Design and Teaching Strategy
The transmission protocol of length is broken down into blocks, each of length . For each , we let be a (deterministic) function of . The values of are ignored by the student. The function used to generate from is given in Section II-B3.
Under this design, the -th block of only depends on the -th block of ; see Figure 2 for an illustration. To simplify notation, we define
(12) |
to be the -th block received by student.
The memoryless properties of the channels imply that the ’s are conditionally independent given . Since the function that generates from is the same every time, their distributions are also identical; we denote this common distribution by .
The student needs to distinguish between i.i.d. copies of and i.i.d. copies of . Using (9) and (11), we have
(13) |
By setting to be a fixed constant as increases, it follows that the learning rate is lower bounded by
(14) | ||||
(15) |
where implicitly depends on . In fact, while constant suffices for our purposes, a similar argument holds when increases but satisfies , in which case the operation should also be included on the right-hand side of (15).
Remark 1.
This block structure bears high-level similarity to block-Markov coding in the analysis of relay channels [8, Sec. 16.4]. However, the details here are very different from existing analyses that use joint typicality and related notions. The error exponent associated such analyses would be low due to the effective block length of , whereas we maintain a high error exponent via joint decoding over the entire length- received sequence.
II-B3 Protocol within Each Block
In each block, the teacher receives noisy bits (e.g., in the first block). Let be the fraction of 1 among these bits (i.e. the number of 1s divided by ).
The teacher sends bits of 1 followed by bits of (with a delay of one block), where is defined as
(16) |
A sample plot is given in Figure 3 with . As exemplified in the figure, is non-decreasing, and
(17) | |||
(18) |
where (18) is trivial for , and follows easily from the convexity of for . Specifically, this convexity implies that the derivative of increases for , whereas the chain rule applied to (17) implies that the derivative of decreases for . Since (18) holds with equality for by definition, this behavior of the derivatives implies that (18) also holds for .
In general may not be an integer. When this happens, we can use bits of followed by bits of . The rounding does not affect the error analysis, so to simplify notation, we focus on the integer-valued case. See Footnote 1 on Page 1 for an additional remark on this issue.
Before proceeding with the student strategy and error analysis, we intuitively explain why sending a sorted sequence (i.e. with all the 1s sent before the s) can make decoding easier for the student. Suppose the student receives the string 111100000010. If the student knows in advance that the string sent by the teacher is sorted, they can be confident that the underlined bit is flipped. However, if this were to be an unsorted string, such corrections could not be made.
The reasoning behind our choice of will become apparent in the error analysis in the following subsection, but intuitively, it serves as a carefully-chosen middle ground between the simpler choices (bearing some resemblance to simple forwarding) and (bearing some resemblance to cumulative teaching).
II-C Error Analysis
Continuing the study of a single block defined according to (12), observe that we have the Markov chain
(19) |
and in accordance with (15), we would like the distributions and to be as ‘far apart’ as possible. We break the error analysis into two parts, analyzing the relations and separately.
II-C1 Relating and
Define to follow the conditional distribution of given . The distribution of is simple; recalling that we are focusing on the case , we have the following:
-
•
The bits of are independently distributed;
-
•
If , then the -th bit is 1 with probability and otherwise;
-
•
If , then the -th bit is 1 with probability and otherwise.
We proceed by upper bounding , starting with the case that .
The bits up to index , and the bits after index , are all identically distributed between and , and thus do not provide any distinguishing power. We are left with bits, which are i.i.d. according to either Bernoulli() or Bernoulli().
The Bhattacharyya coefficient between Bernoulli() and Bernoulli() can easily be computed to be , and using (11), we obtain111If we were to explicitly incorporate rounding as discussed following (18), then this equation would be replaced by the slightly looser bound . The subtraction of one in the exponent only amounts to multiplying the entire expression by a constant, which does not impact the resulting exponential decay rate.
(20) |
To simplify the exponent, we use (17) and (18) to obtain
(21) | |||
(22) |
and hence,
(23) |
We can therefore weaken (20) to
(24) |
Although the assumption was used in the derivation of (20), a trivial argument shows that when , the right-hand side of (20) is at least one (since is non-decreasing). Since the Bhattacharyya coefficient never exceeds one, it follows that (24) remains true in the case that .
II-C2 Relating and
The next step is to decompose the distribution of into the simpler distributions . To do so, we use the following definition.
Definition 1.
Let be probability distributions defined on a common finite probability space and be real numbers in that sum to one. The mixture distribution described by is defined as .
We proceed with a simple technical lemma on the Bhattacharyya coefficient on mixtures.
Lemma 1.
Suppose that follows a mixture distribution described by . Then for any distribution , we have
(25) |
Proof.
Using the inequality , we have
(26) | ||||
(27) | ||||
(28) |
Simple re-arranging gives the desired result. ∎
II-D Computational complexity
In this section, we argue that it is possible to execute the described protocol with an overall computation time of (i.e., no higher than that of simply reading the received bits). This is immediate for the teacher, so we focus our attention on the student.
To implement the maximum likelihood decoder, it suffices to compute the associated log likelihood ratio (LLR) given the received bits . In addition, since the blocks are independent by construction and memorylessness, the overall LLR is the sum of per-block LLRs. Hence, we only need to show that each of these can be computed in time .
The only minor difficulty in computing the LLR for a single block is that we need to account for all possible choices of the latent/hidden variable : Letting denote a generic length- block according to (12), we have
(37) |
and similarly for . In addition, since the channel is memoryless, we have
(38) |
where is the -th bit comprising . Naively, we could evaluate (37) and (38) directly, but this would give a per-block complexity of .
To improve this, we recall that for two values , the associated values of only differ for values of in between and . Therefore, we can initially compute for , and then use it to compute the value for by only looking at the first bits of , and so on for ; at each step, we only need to look at the bits affected by incrementing . In this manner, it only takes time to compute all values of . In addition, in (37) follows a scaled binomial distribution, whose probability mass function can be pre-computed in time (assuming constant-time arithmetic operations). Hence, we obtain an overall per-block computation time of .
In addition to the low computational overhead, another advantage of our strategy is that it is anytime. Specifically, it can be run without knowledge of , and we can stop the algorithm any time and ask the student for the estimate . Letting be the resulting error probability with block size and total time , it still holds in this scenario that , as long as and .
III General Binary-Input Discrete Memoryless Channels
In this section, we consider the same model as Section I-A, except that we replace BSC() and BSC() by general discrete memoryless channels (DMCs) and (known to the student and teacher), both of which have binary inputs . We denote the transition probabilities by and . The case that has non-binary input is discussed in Section IV.
We first overview the 1-hop case in Section III-A, following the classical work of Shannon, Gallager, and Berlekamp [10]. We then state an achievable learning rate in Section III-B, and give the protocol and analysis in Section III-C, which are essentially more technical variants of the binary symmetric case. We provide an upper bound on the learning rate (i.e., a converse) in Section III-D, and use it to deduce the optimal learning rate in certain special cases. While the tightness of our achievable learning rate remains open in general, we additionally show in Section III-E that, at least under certain technical assumptions, it cannot be improved for block-structured protocols.
III-A Existing Results on the 1-Hop Case
We first consider a 1-hop scenario in which a single agent makes repeated observations of the unknown variable through repeated uses of a binary-input DMC . This is a well-studied problem, and we will summarize some of the results from [10].
We begin with a generalization of the Bhattacharyya coefficient. For a real number and two random variables defined on the same finite alphabet , let
(39) |
where the upper bound of one follows by applying Jensen’s inequality to . This function can be extended to and by continuity; for instance, . We also note that setting recovers the Bhattacharyya coefficient. We write and interchangeably. Similarly to (10)–(11), we have the following properties for independent pairs and length- i.i.d. vectors:
(40) | |||
(41) |
Given the DMC with inputs and , we define
(42) |
and denote its logarithm by222See Figure 4 on Page 4 for some examples of how varies as a function of .
(43) |
where and are treated as probability distributions on the output alphabet. The first and second derivatives of for are denoted by and , and once again, the values for the endpoints are defined with respect to the appropriate limits.
Suppose that the agent makes a single observation of through , and uses it to estimate . Using the identity , we find that for any , is an upper bound to the probability of a maximum-likelihood decision rule selecting when is true. An analogous property holds for the reverse error event with in place of , and it follows that is an upper bound to the overall error probability for all .
In view of the property in (41), if the agent instead makes independent observations, the upper bound becomes , implying a learning rate of at least
(44) |
The following well-known result reveals that this learning rate is optimal upon optimizing over .
Lemma 2.
[10, Corollary of Thm. 5]333This is a weakened version of the statement in [10], in which we apply in the coefficient to the term. In addition, we specialize to the case that every transmitted symbol is identical, in order to match our setup. Let be chosen to minimize . Then any decoding strategy of the agent after uses of the channel must have an error probability of at least when is equiprobable on .
Since the term is sub-linear in , this immediately gives the following corollary.
Corollary 1.
The optimal learning rate for the 1-hop system is given by .
III-B Achievability Result
We now state the second main result of our paper, giving an achievability result for general binary-input DMCs.
Theorem 2.
Let and be arbitrary binary-input discrete memoryless channels, and let be the supremum of learning rates across all teaching and learning protocols. Then
(45) |
Moreover, this can be improved to
(46) |
While (46) clearly implies (45), we find the expression (45) easier to work with and prove. Once (45) is proved, (46) follows from the fact that one can choose to flip the inputs of , which amounts to replacing by .444At first this leads to the seemingly different learning rate of , but then (46) can be recovered by taking the two outside the outer minimum (swapping minimization order has no effect), and applying the identity (verified by checking all 6 orderings of , , and ).
When the channels and are identical, (45) reduces to the expression in Corollary 1. Since the learning rate cannot be smaller than the 1-hop case (due to a data-processing argument similar to the binary symmetric setting), we conclude that Theorem 46 provides the optimal learning rate in this case. Other scenarios in which Theorem 46 gives the optimal learning rate will be discussed in Section III-D.
III-C Protocol and Analysis for the Achievability Result
Fix an arbitrary value of and define
(47) |
To prove (45), it suffices to show that is an achievable learning rate. We assume without loss of generality that , since a learning rate of zero is trivial.
We again adopt the block structure used in Section II, with denoting the block size. Suppose that in a single block, the teacher receives . We define the log-likelihood ratio (LLR) as
(48) |
and let (respectively, ) be the distribution of under (respectively, ).
Upon receiving , the teacher computes , and sends bits of followed by bits of , where is defined as follows:
(49) |
Since , , and , we have
(50) |
and similarly,
(51) |
which implies that for all . It may be necessary to round to the nearest integer, but similarly to Section II, this does not affect the result, and it is omitted in our analysis.
The main difference in the analysis compared to Section II is establishing the following technical lemma.
Lemma 3.
Under the preceding setup and definitions, we have for all that
(52) |
Proof.
We claim that this result implies
(53) |
To see this, we check the possible cases in the definition of in (49). If , then the left inequality in (53) is trivial, and the right inequality follows by considering two sub-cases: If the in (49) is attained by , then we apply (51), and otherwise, we apply (52). The case is handled similarly.
From here, the analysis is similar to that of Section II. Observe that we have the Markov chain , where denotes the symbols received by the student in a single block. Since takes on two different distributions depending on whether or , we may also view as the output of a ‘single-use’ discrete memoryless channel with input and an output of length . With a slight abuse of notation, we define the quantities and according to this channel.
To upper bound , we use the following straightforward generalization of Lemma 1.
Lemma 4.
Suppose that follows a mixture distribution described by . Then for any distribution , we have
(54) |
and
(55) |
The proof of this lemma is identical to that of Lemma 1, except that is replaced by the more general for (and similarly for ).
Decomposing each distribution and as a mixture based on the (finitely many) possible values of , and applying Lemma 55 twice, we obtain
(56) | ||||
(57) |
To help simplify this expression, we use the following consequence of (53) (again recalling ):
(58) |
Suppose first that . Since is an increasing function and only the bits in between positions and differ, we have
(59) | ||||
(60) |
noting that each differing bit contributes to the total. Combining (58) and (60), the terms containing cancel, and we are left with
(61) |
On the other hand, if , then the increasing property of gives
(62) | |||
(63) |
and combining this with the fact that , we obtain
(64) |
By the definition of in (48), the value of only depends on the number of occurrences of each output symbol in the block of length , and not the specific order. Hence, with a finite output alphabet , the number of possible values is upper bounded by [11, Ch. 2], and we conclude from (57), (61) and (64) that
(65) |
As with (15), this implies that the overall learning rate satisfies
(66) |
By setting large enough, we can obtain any rate arbitrarily close to , completing the proof of Theorem 46.
III-D A Converse Result and Its Consequences
As hinted following Theorem 46, by data processing arguments, we cannot obtain a learning rate greater than
(67) |
and this matches Theorem 46 when . More generally, we can immediately deduce that Theorem 46 gives the optimal learning rate whenever is a degraded version of (i.e., there exists an auxiliary channel such that composing with gives an overall channel with the same transition law as ), or vice versa. This is because the teacher (respectively, student) can artificially introduce extra noise to reduce to the case that .
In this section, we give an additional upper bound on the learning rate (i.e., converse) that improves on the simple one in (67), and establishes the tightness of Theorem 46 in certain cases beyond those mentioned above. To state this converse, we introduce an additional definition. Momentarily departing from the teaching/learning setup, consider the following two-round 1-hop communication scenario defined in terms of the DMCs and a parameter :
-
•
The sender seeks to communicate one of two messages, , each occurring with probability ;
-
•
In the first round of communication, the sender transmits symbols via independent uses of without feedback;
-
•
After this first round, the sender observes the corresponding output symbols via a noiseless feedback link;
-
•
In the second round of communication, the sender transmits symbol via independent uses of , with no further feedback.
This is a somewhat unconventional setup with independent and non-identical channel uses, as well as partial feedback. We let denote the best possible error exponent under this setup as , with the subscript 2 representing its two-round nature.
Theorem 3.
Let and be arbitrary binary-input discrete memoryless channels, and let be the supremum of learning rates across all teaching and learning protocols. Then, for any , we have
(68) |
Proof.
Consider a “genie-aided” setup in which the teacher and student are given the following additional information:
-
(i)
The student can observe the input (rather than only the output ) of the teacher-student channel from time up to .
-
(ii)
The teacher is given the true value of after time .
Since the teacher and student can always choose to ignore this additional information, any upper bound on the learning rate in this genie-aided setting is also an upper bound in the original setting.
The theorem follows by noting that this genie-aided setup can be viewed as an instance of the above two-round communication setup:
-
•
For the first symbols, once the student is given , no further information is provided by (a “degraded” version of ), and the student’s first observed symbols correspond to repeatedly observing through the channel without feedback.
-
•
Similarly, for the last symbols, once the teacher is given , no further information is provided by , but the value of still needs to be conveyed to the student via the uses of . Hence, the student’s final observed symbols correspond to using with no further feedback, but with the first outputs being available at the sender (teacher).
∎
In the remainder of this subsection, we connect the converse result of Theorem 3 with the achievability result of Theorem 46 in two different ways:
-
•
Under a certain technical assumption (Assumption 1 below), we show that there exists such that the exponent in Theorem 46 equals , which is defined in the same way as but with no feedback being available in the communication setup. Hence, the achievability and converse only differ with respect to the availability of feedback.
- •
Starting with the former, we consider the following technical assumption.

Assumption 1.
The DMCs and satisfy the property that for all , it holds that and .
This assumption states that values of are uniformly “better” ( is more negative) than their counterparts flipped about . The opposite scenario (i.e., for all ) can be handled similarly, or can more simply be viewed as reducing to Assumption 1 upon swapping the two inputs. Up to such swapping, we were unable to find any binary-input binary-output channel for which Assumption 1 fails, and even in the multiple-output case, we found that it is difficult to find counter-examples. As two concrete examples, the BSC trivially satisfies , and the reverse Z-channel555This is the regular Z-channel with the inputs swapped. with parameter yields the increasing function . See Figure 4 for further illustrative numerical examples with binary output.
Lemma 5.
For any binary-input DMCs and satisfying Assumption 1, we have
(69) |
where is defined in the same way as above, but with no feedback being available in the communication setup.
Proof.
See Appendix -B ∎
Observe that (69) precisely equals the learning rate derived in Theorem 46, and that we would match the converse in Theorem 3 if we could additionally show that . Unfortunately, even the single round of feedback can strictly increase the error exponent (see Appendix -C for an example), meaning that the optimal learning rate remains unclear. On the positive side, it turns out that whenever is a BSC, and in fact, in this case, we can establish that Theorem 46 gives the optimal learning rate even when Assumption 1 is dropped. Formally, we have the following.
Lemma 6.
For any binary-input DMC , if is a BSC with crossover probability , then the achievable learning rate in (45) is tight, i.e., .
Proof.
See Appendix -D. ∎
Using Lemma 6 as a starting point, it is straightforward to establish that Theorem 3 can indeed strictly improve on the trivial converse in (67).
Corollary 2.
There exist binary-input DMCs and such that the optimal learning rate is strictly smaller than .
Proof.
This follows by letting be a (reverse) Z-channel and be a BSC, with suitably-chosen parameters. See Appendix -E for the details. ∎
III-E A Converse for Block-Structured Protocols
While we have established several special cases where Theorem 46 give the optimal learning rate, it remains open as to whether this is true in general. As a final result, we complement our general converse (Theorem 3) with a restricted converse that only applies to block-structured protocols. Specifically, we say that a teaching strategy is block structured with length if it follows the structure in Figure 2: For any positive integer , the bits transmitted at indices are only allowed to depend on the bits received at indices .
Theorem 4.
For any and satisfying Assumption 1, and any protocol such that the teaching strategy is block structured with length , the learning rate is at most .
Proof.
See Appendix -F. ∎
IV Conclusion and Discussion
We have established optimal learning rates for the problem of teaching and learning a single bit under binary symmetric noise, using a simple and computationally efficient block-structured strategy. We have also adapted this technique to general binary-input DMCs.
As discussed above, while the optimal learning rate for general binary-input DMCs follows in several cases of interest, it remains unknown in general, leaving us with the following open problem.
Open Problem 1.
Does (46) give the optimal learning rate for arbitrary binary-input DMCs and ?
An important step towards answering this question would be removing Assumption 1 in the results that currently make use of it. Also regarding this assumption, we expect that it should hold true for all binary-input channels, and it would be of interest to establish whether this is indeed the case.
Another possible direction is to increase the input alphabet size of , removing the assumption of binary inputs. In the 1-hop case, having a channel with more inputs does not improve the learning rate beyond choosing the two best inputs for the channel [10]. By a similar idea, we can immediately deduce an achievability result via Theorem 46, but we are left with the following open problem.
Open Problem 2.
Does there exist a pair of DMCs , such that every restriction of to two inputs gives a strictly suboptimal learning rate?
In addition to these open problems, interesting directions may include the case that the channel transition laws are unknown to the teacher and/or student, and the case of teaching more than a single bit.
-A Proof of Lemma 3 (Technical Result for General Binary-Input DMCs)
We first state an auxiliary result from [10]. When the agent seeks to decide between the alternatives and , the natural decision rule is to set a threshold on the corresponding log-likelihood ratio. The following result bounds the probability that this log-likelihood ratio test fails under a given threshold.
Lemma 7.
[10, Proof of Thm. 5] Let be a binary-input DMC, and for each output symbol , define
(70) |
to be the log-likelihood ratio between the two conditional distributions. Let and follow the distribution of under and respectively. For all , we have
(71) |
and
(72) |
We now proceed with the proof of Lemma 3. First suppose that satisfies
(73) |
Then, since is continuous and non-decreasing [10], there exists such that , and by (71) and (72) (applied to the -fold product distribution), we obtain
(74) |
and
(75) |
On the other hand, if
(76) |
then we may set , and (74) still holds due to the fact that . In this case, we have
(77) |
where (a) follows from fact that (75) holds when . It follows that (75) also holds here with . By a similar argument, if , then we may set , and (74) and (75) remain valid. Therefore, for every value of (possibly ), there exists some such that equations (74) and (75) hold.
Since is convex (see [10, Thm. 5]), is lower bounded by its tangent line approximation at . This implies
(78) |
which re-arranges to give
(79) |
Multiplying through by and dividing through by , we obtain
(80) |
Finally, further upper bounding the left-hand side via (74) and further lower bounding the right-hand side via (75), we deduce the desired inequality in (52).
-B Proof of Lemma 5 (1-Hop Exponent Without Feedback)
Let and be the two codewords in the 1-hop communication problem without feedback, let denote the received sequence, and define the -letter version of as
(81) |
Using the classical analysis of Shannon, Gallager, and Berlekamp [10, Thm. 5] (see also Lemma 2 regarding the case ), the error exponent attained by optimal (maximum-likelihood) decoding is [10, Thm. 5]666The term in [10, Thm. 5] scales as due to the additive property of for product distributions, and thus, it does not affect the error exponent.
(82) |
and accordingly, we proceed by analyzing .
Without loss of optimality, we can assume that and differ in every entry. Let be the fraction of uses of for which takes value (and hence takes value ), and define analogously. Then, the additive property of for product distributions (by taking the logarithm in (40)) gives
(83) |
noting that by the definition of , swapping the two inputs amounts to replacing by .
To further simplify (83), we define
(84) | |||
(85) |
and introduce the shorthand
(86) |
We first provide a lemma lower bounding , and then show that is itself a lower bound on .
Lemma 8.
There exists a choice of such that for all , it holds that .
Proof.
We split the proof into several cases. Throughout, it should be kept in mind that is a continuous and convex function [10].
Case 1: If and attains its global minimum777Here “global minimum” is meant with respect to the restricted domain . at , then we can set to get
(87) |
A similar argument holds when and attains its global minimum at (setting ).
Case 2: If and does not attain its global minimum at , then we can shift towards ’s minimizer to decrease , contradicting the definition of . Hence, this case does not occur. A similar finding applies to the case that and does not attain its global minimum at
Case 3: We claim that the above cases only leave the case that and neither nor attain their global minimum at . To see this, note that if , then the larger one attaining its global minimum would give Case 1, whereas the larger one not attaining its global minimum would give Case 2. Hence, we may assume that . Then, we can additionally assume that neither nor attain their global minimum at , since otherwise we would again be in Case 1. We proceed with two further sub-cases.
Case 3a: If , then we can shift by a small amount and decrease , contradicting the fact that is the minimizer (unless we are at an endpoint of the interval , in which case one of and must be a global minimum, again a contradiction). Hence, this sub-case does not occur.
Case 3b: If , then we choose
(88) |
The assumption ensures that these quantities both lie in . Then, taking the derivative in (86), we obtain
(89) |
Since is convex (following from and being convex), this implies that must attain its global minimum at . In addition, since , their common value must equal , which further implies . ∎
We now relate to . Recall from Assumption 1 that and . If , then substituting these into (83) gives
(90) |
On the other hand, if , then Assumption 1 gives , and substitution into (83) gives
(91) |
Combining these two cases, we have for all that for one of the two choices , and hence, Lemma 8 implies there exists such that
(92) |
In view of the fact that (82) is the optimal error exponent, we have shown that there always exists such that the error exponent is upper bounded by .
To complete the proof, we also need to show that for all , there exist choices of the codewords and such that . This is much simpler, following directly by setting (i.e., choosing the all-zeros and all-ones codewords) in (83), setting , upper bounding both and by .
-C A Case Where Feedback Increases the 1-Hop Exponent
Let be a BSC with crossover probability , and let be a reverse Z-channel (RZ-channel for short) with and . Note that Assumption 1 holds for these channels, as discussed just above Lemma 5. Consider the following communication strategy with partial feedback:
-
•
Repeatedly input to the BSC in the first channel uses.
-
•
After observing the feedback, if more than half the BSC bits are flipped, send the all-s sequence over the RZ-Channel in uses, and otherwise send the all-s sequence.
At the decoder, an initial estimate is formed based on maximum-likelihood decoding (i.e., a majority vote) on the received BSC symbols. Then, after receiving the RZ-channel output, the decision is reversed (i.e., ) if any s are observed among the received symbols; otherwise, the decision is kept (i.e., ). Essentially, the Z-channel uses are used to tell the decoder whether the initial decision should be changed or not.
If the initial decision is correct, then the final decision is correct with conditional probability one, since the all s sequence is received noiselessly. On the other hand, if the initial decision is incorrect, then the probability that it fails to be reversed is . Hence, an error only occurs if both steps fail, and the overall error exponent of this strategy is given by
(93) |
where is the optimal error exponent for transmitting two messages over a BSC (see (5)).
In Figure 5, we compare the exponent in (93) to that without feedback (Lemma 5), setting and , which turns out to make the optimal exponents of and identical. In this example, we see that feedback strictly increases the exponent for all . Intuitively, the improvement comes from being able to “reverse” initial “bad events” to turn them into good events, which is not possible in the absence of feedback.

-D Proof of Lemma 6 (Tightness for Symmetric )
Throughout the proof, we let be a shorthand for , and similarly for and other analogous quantities.
We make use of Theorem 3, and accordingly consider a (feedback) communication scenario with uses of and uses of . Since we assume to be equiprobable on , the optimal decoding rule given a received sequence is to set if , and otherwise. Under this optimal rule, the error probability is given by [10, Eq. (3.2)]
(94) |
We momentarily consider the case that there is no feedback, in which we can simply define and to be the two length- codewords (one per message), and choose these codewords to minimize the error probability, yielding
(95) |
Since this expression is still exactly equal to smallest possible error probability, its corresponding exponent is , defined analogously to but without feedback.
In the two-round setting with feedback considered in Theorem 3, slightly more care is needed. We proceed by explicitly splitting up the codeword and received sequence as and , and writing the following for :
(96) |
where explicitly writes as a function of and , and similarly for . Substituting into (94), we obtain
(97) | |||
(98) |
We now observe that this expression coincides with the non-feedback case given in (95), except that the minimum over comes after the summation over (since the former is allowed to depend on the latter).
While this ordering can be significant in general, we now show that it has no effect when is a BSC. Without loss of optimality, we can assume that and differ in every entry, since any entries that coincide would be independent of (at least conditionally given ) and reveal no information (or viewed differently, could be simulated at the decoder). However, when is a BSC, its symmetry implies that any quantity of the form
(99) |
(with constants and ) is identical for all such pairs; the precise choice only amounts to re-ordering terms in the summation over . It follows that interchanging the order of and in (98) has no impact, and we recover the non-feedback expression (95), whose error exponent is defined to be .
It only remains to show that the error exponent is no higher than the learning rate in (45). To show this, we re-use some of the findings from Appendix -B.888Assumption 1 was not applied until after Lemma 8 in Appendix -B, and we will not use it here. In particular, the error exponent is still given by (82), and still satisfies (83), assuming without loss of optimality that the two codewords and differ in every entry.
For any fixed , we can make in (83) smaller or equal by choosing both and to be either zero or one (e.g., if then we set ). That is, without loss of optimality, every input to is identical, and every input to is identical. Assuming without loss of generality that , and noting that the symmetry of the BSC implies , it follows that we can simplify (83) to
(100) |
This precisely coincides with the quantity introduced in (86), and hence, we can directly apply Lemma 8 to obtain for some and all , where is defined in (84). Since coincides with (45), this completes the proof.
-E Proof of Corollary 2 (Strict Improvement in the Converse)
To see that the converse in Theorem 3 can strictly improve on the trivial one in (67), we re-use the example shown in Figure 5 in Appendix -C, but with and interchanged (i.e., is a reverse Z-channel with parameter , and is a BSC with parameter ). In this case, it is straightforward to verify that the “No Feedback” curve in Figure 5 is simply flipped with respect to the mid-point , and in particular, all values of still give a strictly smaller exponent than the endpoints and .
The key difference due to interchanging and is that in view of the above proof of Lemma 6, the feedback no longer helps, and hence, the same curve just discussed serves as a valid converse even with the single round of feedback. Hence, by Theorem 3 and the fact the simple converse in (67) corresponds to taking the better of and , we conclude that a strict improvement has indeed been shown.
While the reliance of the preceding proof on a numerical evaluation can be circumvented, we omit such an approach, since we believe that this example has no ambiguity with respect to potential numerical issues.
-F Proof of Theorem 4 (Converse for General Block-Structured Protocols)
We first consider a single block. Letting denote the symbols received by the teacher, denote the transmitted bits, and denote the block received by the student (here we omit the vector notation to lighten the exposition), we have the Markov chain .
We begin with an inequality on mixture distributions (cf., Definition 1). Let and be probability distributions, and suppose follows a mixture distribution described by . Then, observe that
(101) | ||||
(102) | ||||
(103) |
where (102) applies Jensen’s inequality to the concave function (), and (103) lower bounds the average by the minimum. The same argument applies when is fixed and is a mixture.
For convenience, define . Considering two length- sequences , by the additive property of (i.e., multiplicative property of ) for product distributions, the corresponding conditional distributions for satisfy
(104) |
where is the number of indices where is zero and is one, and vice versa for .
Supposing first that , we can use the assumption from Assumption 1, and weaken (104) to
(105) |
since . Observe that (105) gives a lower bound that holds uniformly with respect to . Then, using the property (103) and the fact that and are mixture distributions mixed by , we see that (105) implies (for ) that
(106) |
This gives us a -dependent lower bound, and the -dependent lower bound turns out to be simpler: Using the Markov chain , along with the data processing inequality for ,999The data processing inequality for is equivalent to that for Rényi divergence, defined as . For the data processing inequality for Rényi divergence, see for example [12, Example 2]. we have
(107) |
Combining the lower bounds, we have for all that
(108) |
If , then Assumption 1 gives , and the analog of (106) is
(109) |
Since (107) holds for all , it follows that for any , we have
(110) |
Combining the cases and , we deduce that
(111) |
Having studied a single block, we are now in a position to study the overall combination of blocks,101010The first block received by the student does not depend on , and provides no distinguishing power. defining the -letter version of as
(112) |
where is the entire length- sequence received by the student. Assuming momentarily that the teacher employs an identical strategy within each block, the additive property of for product distributions (arising from and being memoryless) gives
(113) |
and substituting (111) gives
(114) |
In fact, this lower bound holds even if the teacher applies a different strategy between blocks, since we showed that (111) holds regardless of the strategy used within the block.
Letting , we have from [10, Thm. 5] (a more general form of Lemma 2) that, for any decoding rule employed by the student, the error probability is lower bounded by
(115) |
where denotes the second derivative when , and the appropriate limiting value when . Again using the additive property of , the following holds if the teacher employs the same per-block strategy in each block:
(116) |
Once again, our analysis also extends immediately to the case of varying per-block strategies.
In Lemma 9 below, we show that , and substitution into (116) gives . The assumption then yields , or . Substituting this scaling into (115), applying (114), and taking suitable limits, we obtain
(117) |
We deduce Theorem 4 by further upper bounding the right-hand side by expanding the minimum from to ; (117) additionally reveals that further restricting is without loss of optimality, which is unsurprising given Assumption 1.
It only remains to show the following.
Lemma 9.
For any fixed and , and any , it holds that .
Proof.
To reduce notation, define and . It is shown in [10, p. 85] that can be written as the variance of a log-likelihood ratio:
(118) |
where is distributed according to the following “tilted” distribution when :
(119) |
In addition, when equals an endpoint (i.e., 0 or 1), we can use the same expression for , except that must be restricted to satisfy both and (with otherwise) [10].
We upper bound the variance by the second moment, i.e., , and proceed by further upper bounding the latter. By the definition of , any such that or does not contribute to the second moment. On the other hand, for , if , then we have
(120) | ||||
(121) |
where is the -fold product of (and similarly for ), is the transmitted sequence when the teacher receives , and are the smallest non-zero transition probabilities of and . It follows that each non-zero is bounded between and , which implies that , and hence .
∎
Acknowledgment
This work was supported by the Singapore National Research Foundation (NRF) under grant number R-252-000-A74-281.
References
- [1] V. Jog and P. L. Loh, “Teaching and learning in uncertainty,” IEEE Transactions on Information Theory, vol. 67, no. 1, pp. 598–615, 2021.
- [2] W. Huleihel, Y. Polyanskiy, and O. Shayevitz, “Relaying one bit across a tandem of binary-symmetric channels,” IEEE International Symposium on Information Theory (ISIT), 2019.
- [3] S. Rajagopalan and L. Schulman, “A coding theorem for distributed computation,” ACM Symposium on Theory of Computing, 1994.
- [4] X. Vives, “How fast do rational agents learn?” The Review of Economic Studies, vol. 60, no. 2, pp. 329–347, 1993.
- [5] A. Jadbabaie, P. Molavi, and A. Tahbaz-Salehi, “Information heterogeneity and the speed of learning in social networks,” Columbia Business School Research Paper, no. 13-28, 2013.
- [6] P. Molavi, A. Tahbaz-Salehi, and A. Jadbabaie, “Foundations of non-bayesian social learning,” Columbia Business School Research Paper, no. 15-95, 2017.
- [7] M. Harel, E. Mossel, P. Strack, and O. Tamuz, “Rational Groupthink,” The Quarterly Journal of Economics, vol. 136, no. 1, pp. 621–668, 07 2020.
- [8] A. El Gamal and Y.-H. Kim, Network information theory. Cambridge university press, 2011.
- [9] S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.
- [10] C. Shannon, R. Gallager, and E. Berlekamp, “Lower bounds to error probability for coding on discrete memoryless channels. i,” Information and Control, vol. 10, no. 1, p. 65–103, 1967.
- [11] I. Csiszár and J. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011.
- [12] T. van Erven and P. Harremos, “Rényi divergence and Kullback-Leibler divergence,” IEEE Transactions on Information Theory, vol. 60, no. 7, pp. 3797–3820, 2014.
Yan Hao Ling received the B.Comp. degree in computer science and the B.Sci. degree in mathematics from the National University of Singapore (NUS) in 2021. He is now a PhD student in the Department of Computer Science at NUS. His research interests are in the areas of information theory, statistical learning, and theoretical computer science. |
Jonathan Scarlett (S’14 – M’15) received the B.Eng. degree in electrical engineering and the B.Sci. degree in computer science from the University of Melbourne, Australia. From October 2011 to August 2014, he was a Ph.D. student in the Signal Processing and Communications Group at the University of Cambridge, United Kingdom. From September 2014 to September 2017, he was post-doctoral researcher with the Laboratory for Information and Inference Systems at the École Polytechnique Fédérale de Lausanne, Switzerland. Since January 2018, he has been an assistant professor in the Department of Computer Science and Department of Mathematics, National University of Singapore. His research interests are in the areas of information theory, machine learning, signal processing, and high-dimensional statistics. He received the Singapore National Research Foundation (NRF) fellowship, and the NUS Early Career Research Award. |