New identities for the Shannon-Wiener entropy function with applications
Abstract
We show how the two-variable entropy function can be expressed as a linear combination of entropy functions symmetric in and involving quotients of polynomials in of degree for any .
Keywords: entropy, Binary Symmetric Channel, Shannon, Wiener, identities
2020 MSC: 94A15 Information theory, 94A17 Measures of information, entropy, 94A60 Cryptography
1 Introduction
In Renyi’s A Diary on Information Theory, [8], he points out that the entropy formula, i.e.,
where logs are to the base 2, was arrived at, independently, by Claude Shannon and Norbert Wiener in 1948. This famous formula was the revolutionary precursor of the information age. Renyi goes on to say that,
-
this formula had already appeared in the work of Boltzmann which is why it is also called the Boltzmann-Shannon Formula. Boltzmann arrived at this formula in connection with a completely different problem. Almost half a century before Shannon he gave essentially the same formula to describe entropy in his investigations of statistical mechanics. He showed that if, in a gas containing a large number of molecules the probabilities of the possible states of the individual molecules are then the entropy of the system is where is a constant. (In statistical mechanics the natural logarithm is used and not base 2 …). The entropy of a physical system is the measure of its disorder …
Since 1948 there have been many advances in information theory and entropy. A well-known paper of Dembo, Cover and Thomas is devoted to inequalities in information theory. Here, we concentrate on equalities. We show how a Shannon function can be expanded in infinitely many ways in an infinite series of functions each of which is a linear combination of Shannon functions of the type , where are quotients of polynomials of degree for any . Apart from its intrinsic interest, this new result gives insight into the algorithm in Section 6 for constructing a common secret key between two communicating parties — see also [2], [3], [4], [6].
2 Extensions of a Binary Symmetric Channel
Recall that a binary symmetric channel has input and output symbols drawn from . We say that there is a common probability of any symbol being transmitted incorrectly, independently for each transmitted symbol, .
We use the channel matrix . Again, is the probability of success, i.e., denotes the probability that 0 is transmitted to 0 and also the probability that 1 gets transmitted to 1. The second extension of has alphabet and channel matrix
An alternative way to think of an extension of a channel (see Welsh [9]) is to regard it as copies of acting independently and in parallel. See Figure 1.

Let us assume also that, for , the input probability of 0 and the input probability of 1 are both equal to .
Theorem 2.1.
Let , and denote an input-output pair for . Then
-
(a)
-
(b)
-
(c)
is equal to .
-
(d)
The capacity of is .
-
Proof.
(a) Since, by definition, the are independent,
We have
Thus .
(b) The value of only depends on . The are independent. Thus are independent. For any we have
Also . Then, as for , . Thus .
(c) We have . Since we have . Now
where denotes a given value of the random vector . Since the channel is memoryless,
The last step needs a little work — see [5] Exercise 4.10 or [7] or [3] or the example below for details. Then
Thus
Example Let and . Then or or or . Using independence we get that the coresponding entropy term is . This simplifies to . Note that the probability that is .
(d) The capacity of is the maximum value, over all inputs, of . Since is random, the input probability of a 1 or 0 is 0.5. This input distribution maximizes for , the maximum value being . Then the capacity of is . It represents the information about conveyed by or the amount of information about conveyed by .
3 An Entropy Equality
First we need some additional discussion on entropy.
A. Extending a basic result.
A fundamental result for random variables is that . A corresponding argument may be used to establish similar identities involving more than two random variables. For example,
Also
B. From Random Variables to Random Vectors.
For any random variable taking only a finite number of values with probabilities such that
we define the entropy of using the Shannon formula
Analogously, if is a random vector which takes only a finite number of values , we define its entropy by the formula
For example, when is a two-dimensional random vector, say with , then we can write
Note that .
More generally, if is a collection of random variables each taking only a finite set of values, then we can regard as a random vector taking a finite set of values and we define the joint entropy of by
Standard results for random variables then carry over to random vectors — see [1], [9].
C. The Grouping Axiom for Entropy.
This axiom or identity can shorten calculations. It reads as follows ([9, p. 2], [1, p. 8], [3, Section 9.6]).
Let and where each and is non-negative. Assume that are positive with . Then
For example, suppose so . Then we get
This is because .
Theorem 3.1.
Let be random vectors such that . Then
-
(a)
.
-
(b)
.
-
Proof.
For (b),
4 The New Identities
We will use the above identity, i.e.,
(4.1) |
which holds under the assumption that . We begin with arrays , , where is even. We assume that are random binary strings subject to the condition that, for each , we have . We also assume that the events form an independent set. We divide into blocks of size 2.
To start, put , , .
Lemma 4.2.
-
Proof.
We want to calculate . Given , say , the value of is . There is no uncertainty in the value of given , i.e., each term in the above sum for is . Therefore .
From this we can use formula (4.1) for this block of size two. We can think of a channel from to (or from to ) which is the second extension of a binary symmetric channel where is the probability of success. We have
From Theorem 2.1 part (c) the left side, i.e., is equal to . Next we calculate the right side beginning with , i.e., . We have
We know that and since . From the standard formula we have since .
Next we calculate
Again we have two possibilities, i.e., and . The corresponding probabilities are and respectively. We obtain
This comes about from the facts that
-
(a)
If then we either have and or , .
-
(b)
If then either and or and .
-
(c)
.
Then from equation (4.1) we have our first identity as follows
(4.3) |
Blocks of Size Three.
Here , , . As in Lemma 4.2 we have so we can use formula (4.1) again, i.e.,
We have a channel from to (or from to ) which is the third extension of a binary symmetric channel , where is the probability that 0 (or 1) is transmitted to itself.
From Theorem 2.1 we have .
Similar to the case of blocks of size 2, we have . This is because the probabilities that or are, respectively, or , as follows.
If , then either , , or else, for some , (3 possibilities) and, for the other two indices , and .
A similar analysis can be carried out for the case where . We then get where
We now use the grouping axiom for . The in the formula refers to here and the there is now replaced by . Then
is obtained by interchanging with . We note that, since , .
From working with blocks of size 3 we get
(4.4) |
For blocks of size 2 formula (4.3) can be put in a more compact form in terms of capacities, namely,
(4.5) |
Using the same method we can find a formula analogous to formulae (4.3), (4) for obtaining as a linear combination of terms of the form where involve terms in plus extra terms such as as in formula (4).
5 Generalizations, an Addition Formula
The result of Theorem 2.1 can be extended to the more general case where we take the product of binary symmetric channels even if the channel matrices can be different corresponding to differing -values.
As an example, suppose we use the product of 2 binary symmetric channels with channel matrices
Then the argument in Section 4 goes through. To avoid being overwhelmed by symbols we made a provisional notation change.
Notation 5.1.
We denote by the quantity .
Then we arrive at the following addition formula
(5.2) |
Similarly to the above we can derive a formula for .
6 The Shannon Limit and Applications to Cryptography
The above method of using blocks of various sizes is reminiscent of the algorithm for the key exchange in [4] which relates to earlier work in [2], [6] and others. Indeed the identities above were informed by the details of the algorithm.
The algorithm starts with two arrays and . We assume that the set of events is an independent set with . We subdivide into corresponding sub-blocks of size , where divides . Exchanging parities by public discussion we end up with new shorter sub-arrays , where the probabilities of corresponding entries being equal are independent with probability . Eventually after iterations we end up with a common secret key .
Let us take an example. Start with two binary arrays and of length with even, , say. We subdivide the arrays into corresponding blocks of size 2. If the two blocks are and we discard those blocks if the parities disagree. If the parities agree, which happens with probability , we keep and , discarding and . Thus, on average, we keep partial blocks and discard blocks of size 2.
Let us suppose and . From Theorem 2.1 part (d), the information that has about , i.e., that has about is Shannon bits.
We are seeking to find a sub-array of such that corresponding bits are equal. Our method is to publicly exchange parities. The length of this secret key will be at most 11.
Back to the algorithm. keep on average blocks of size 2, i.e., blocks of size 2. and remove the bottom element of each block. We are left with 29 pairs of elements . The probability that given that is , i.e., . Next, . To summarize, we started with 100 pairs with . The information revealed to by is Shannon bits of information. After the first step of the algorithm we are left with 29 pairs with . The amount of information revealed to the remnant of by the remnant of is Shannon bits of information. So we have “wasted” less than 1 bit, i.e. the wastage is about 8%. Mathematically we have Wastage. In general we have
where denotes the wastage. Dividing by we get
Comparing with formula(4.5) we see that . In this case .
To sum up then the new identities tell us exactly how much information was wasted and not utilized. They also tell us, in conjunction with the algorithm, the optimum size of the sub-blocks at each stage.
One of the original motivations for work in coding theory was that the Shannon fundamental theorem showed how capacity was the bound for accurate communication but the problem was to construct linear codes or other codes such as turbo codes that came close to the bound.
Here we have an analogous situation. In the example just considered the maximum length of a cryptographic common secret key obtained as a common subset of is bounded by . The problem is to find algorithms which produce such a common secret key coming close to this Shannon bound of .
This work nicely illustrates the inter-connections between codes, cryptography, and information theory. Information theory tells us the bound. The identities tell us the size of the sub-blocks for constructing a common secret key which attains, or gets close to, the information theory bound.
Coding theory is then used to ensure that the two communicating parties have a common secret key by using the hash function described in the algorithm using a code . Error correction can ensure that the common secret key can be obtained without using another round of the algorithm (thereby shortening the common key) if the difference between the keys of and is less than the minimum distance of the dual code of . This improves on the standard method of checking parities of random subsets of the keys at the last stage.
7 Concluding Remarks.
-
1.
Please see Chapter 25 of the forthcoming Cryptography, Information Theory, and Error-Correction: A Handbook for the Twenty-First Century, by Bruen, Forcinito, and McQuillan, [3], for background information, additional details, and related material.
-
2.
Standard tables for entropy list values to two decimal places. When is close to 1 interpolation for three decimal places is difficult as is very steeply sloped near . Formula 4.3 may help since is less than , and the formula can be re-iterated.
- 3.
-
4.
The methods in this note suggest possible generalizations which we do not pursue here.
Acknowledgement: The author thanks Drs Mario Forcinito, James McQuillan and David Wehlau for their help and encouragement with this work.
References
- [1] R. B. Ash. Information Theory. Dover, 1990.
- [2] C. H. Bennett and G. Brassard. Quantum cryptography: Public key distribution and coin tossing. In International Conference on Computers, Systems & Signal Processing, December 1984.
- [3] A. A. Bruen, M. A. Forcinito, and J. M. McQuillan. Cryptography, Information Theory, and Error-Correction: A Handbook for the Twenty-First Century. Wiley, second edition, 2021. to appear.
- [4] A. A. Bruen, D. L. Wehlau, and M. Forcinito. Error correcting codes, block designs, perfect secrecy and finite fields. Acta Applicandae Mathematica, 93, September 2006.
- [5] G. A. Jones and J. M. Jones. Information and Coding Theory. Springer, 2000.
- [6] U. Maurer and S. Wolf. Privacy amplification secure against active adversaries. In Advances in Cryptology, Proceedings of Crypto’97, pages 307–321. Springer-Verlag, August 1997.
- [7] R. J. McEliece. The Theory of Information and Coding. Addison-Wesley, 1978.
- [8] A. Renyi. A Diary on Information Theory. Wiley, 1987.
- [9] D. Welsh. Codes and Cryptography. Oxford, 1988.