On Multi-Channel Huffman Codes for Asymmetric-Alphabet Channels
Abstract
Zero-error single-channel source coding has been studied extensively over the past decades. Its natural multi-channel generalization is however not well investigated. While the special case with multiple symmetric-alphabet channels was studied a decade ago, codes in such setting have no advantage over single-channel codes in data compression, making them worthless in most applications. With essentially no development since the last decade, in this paper, we break the stalemate by showing that it is possible to beat single-channel source codes in terms of compression assuming asymmetric-alphabet channels. We present the multi-channel analog of several classical results in single-channel source coding, such as that a multi-channel Huffman code is an optimal tree-decodable code. We also show some evidences that finding an efficient construction of multi-channel Huffman codes may be hard. Nevertheless, we propose a suboptimal code construction whose redundancy is guaranteed to be no larger than that of an optimal single-channel source code.
I Introduction
Zero-error source coding is one of the oldest branches in information theory. In a traditional (single-channel) source coding problem, a transmitter wishes to send an information source to a receiver with zero error through an error-free channel. To better utilize the channel, it is desirable to encode the information source in such a way that minimizes the number of symbols transmitted through the channel while ensuring correct decoding. In technical terms, the goal is to minimize the expected codeword length of a source code for an information source. Given an arbitrary information source, the well-known Huffman procedure [1] can produce an optimal111In this paper, the optimality we considered is for symbol-by-symbol coding with known probability masses of an information source. code known as a Huffman code. On the other hand, the entropy bound shows that the information theoretical lower bound of the expected codeword length of a source code is the entropy of the source information. Despite the optimality of the Huffman code, it does not achieve the entropy bound in general.
As a natural generalization, zero-error multi-channel source coding was proposed in [2] where each source symbol is mapped to a codeword which spreads across multiple channels. This problem is more complicated than its single-channel counterpart, as the sequence of received symbols in individual channels are not necessary uniquely decodable, even though the overall source code is uniquely decodable. In [2], only the case where all channels use the same alphabet size, i.e., symmetric-alphabet channels, was investigated. Among many other results, [2] showed that a multi-channel code (for symmetric-alphabet channels) can be equated to a single-channel code, so that the existing tools of source coding can be applied to analyze the multi-channel code. However, this implies that such a multi-channel code cannot improve the compression ability. It is thus natural to ask: What if at least one channel has an alphabet size different from that of another channel, i.e., we have asymmetric-alphabet channels? Are there any new opportunities and challenges under this extended setting?
I-A Motivating Examples
Our starting point is the observation that, in some cases, a multi-channel code with asymmetric alphabets achieves a better compression ability than an optimal single-channel code. Specifically, we consider the network in Fig. 1 and compare the expected codeword (description) length222The description length of a codeword is the number of information unit (e.g., nat) required to represent this codeword. of an optimal -ary tree-decodable code333A multi-channel prefix-free code might not have a decoding tree. We defer the detailed discussion regarding this issue to Section II-D., the binary Huffman code, and the ternary Huffman code for the information sources with probability masses and respectively in the following examples. For convenience, we use the information unit nat (base , the Euler’s number) throughout this paper.
Example 1.
In Fig. 2(a), the codeword lengths for the and masses are nats and nats respectively. The expected codeword length of the -ary code is nats, which is the entropy of the information source. The expected codeword length of the corresponding binary and ternary Huffman codes are nats and nats respectively.
Example 2.
In Fig. 2(b), the codeword lengths for the and masses are nats and nats respectively. The expected codeword length is nats, which is the entropy of the information source. The expected codeword length of the corresponding binary and ternary Huffman codes are nats and nats respectively.
-ary | -ary | -ary | |
---|---|---|---|
Fig. 2(a) | 1.32966134885 | 1.38629436112 | 1.46481638489 |
Fig. 2(b) | 1.24245332489 | 1.27076983103 | 1.46481638489 |
We summarize the above two examples and give the numerical values of the expected codeword length in Table I. We can see that the expected codeword length when both channels are in used is shorter than the other two single-channel cases, i.e., the optimal multi-channel source code outperforms the optimal single-channel source codes.
I-B Our Contributions
Given the above examples, it is natural to ask if a multi-channel generalization of the Huffman procedure systematically produces an optimal code. We give a partial affirmative answer to the above question. In particular, we formulate the generalized Huffman procedure and show that it indeed produces an optimal tree-decodable code. The generalized Huffman procedure follows the same idea of iteratively merging the smallest probability masses into bigger ones, until reaching probability . However, unlike the single-channel case, there is more than one choice of the number of masses to be merged in each iteration. This results in an exponential number of potential merge sequences, and only choosing a “correct” merging step in every iteration results in an optimal code. It is therefore unclear how the generalized Huffman procedure can be computed in polynomial time.
Towards designing an algorithm which computes or approximates the generalized Huffman procedure in polynomial time, we devise and investigate several heuristics for pruning the merging steps. We first investigate pruning by some natural measures such as (our multi-channel generalization of) local redundancy, expected codeword length, and entropy. Unfortunately none of these simplistic strategies can guarantee the production of an optimal code.
One of our results is the following pruning strategy based on iterative executions of the (single-channel) Huffman procedure: For a multiset of probability masses either given initially or produced by the previous iteration, we run the Huffman procedure on these masses with respect to the alphabet size of each channel. Let be the alphabet size of the resulting Huffman code with the smallest expected length. We merge the smallest probability masses into one and proceed to the next round.
Since all single-channel Huffman codes are included as potential choices of the above minimizing procedure, the resulting code guarantees a compression ability no worse than any of the optimal single-channel codes. While we are unable to show that this heuristic produces an optimal tree-decodable code, we have yet to find a counterexample either. We leave it as an open problem to prove the (sub)optimality of the resulting code. We remark that even if the heuristic produces an optimal tree-decodable code, it is unclear if the code is also an optimal prefix code except for the -channel case. It is also unclear if an optimal prefix code is an optimal uniquely decodable code.
I-C Paper Organization
We first introduce the background of multi-channel source coding in Section II. Then, we present some generalizations of some well-known results for single-channel source coding in Section III and the multi-channel Huffman procedure for asymmetric-alphabet channels in Section IV. We show some evidence that an efficient construction of multi-channel Huffman codes is not trivial and propose a suboptimal code construction in Section V. Lastly, we conclude the paper and propose some future directions in Section VI.
II Background
Unless otherwise specified, we use the notation defined in this section throughout this paper. For , define .
II-A Multi-Channel Source Coding
The definition of a multi-channel source code was proposed in [2] for symmetric-alphabet channels and extended in [3] for asymmetric-alphabet channels. Consider an -channel system where the th channel uses an alphabet of size , . Without loss of generality, we assume in this paper. Let be the alphabet of the information source . Denote by the empty string. Define and for . That is, contains all the strings of length over the alphabet . The set of all the strings of countable length over the alphabet is denoted by . Every element in is called a word.
Definition 1 (Multi-Channel Source Codes).
A -ary source code for the information source is a mapping from to .
For any source symbol , the ordered -tuple is the codeword for . The th component of the codeword is transmitted through the th channel. When we transmit more than one codeword, the codewords are concatenated channel-wise, i.e., the boundaries of the codewords are not explicit.
Definition 2 (Uniquely Decodable Codes).
If the finite sequences of codewords of any two distinct finite sequences of source symbols are different, then the source code is called a uniquely decodable code.
Definition 3 (Multi-Channel Prefix-Free Codes).
Two words are prefix free (to each other) if there exists at least one channel such that the th component of the two words are prefix free. A -ary prefix-free code is a -ary source code such that every pair of codewords are prefix free.
II-B Single-Channel Representation
As a channel does not interfere the symbols in another channel, we can express a multi-channel word as a single-channel word with a dynamic alphabet. For example, let and . A -channel word can be expressed as either , , , , , , , , , or . Note that and have different letter costs.
However, this single-channel representation is different from a single-channel word with unequal letter cost, a concept proposed in [4]. For example, in a single-channel code with unequal letter cost for the alphabet , the words and are prefix-free. However, the -channel words and are not prefix-free although their single-channel representations are and respectively.
When the alphabet sizes of the channels are the same, we can see that the single-channel representation of a multi-channel code can be reduced to a single-channel code with equal letter cost. That is why the compression ability cannot be enhanced in the setting proposed in [2].
II-C Kraft Inequality and Entropy Bound
Let and be the number of channels and number of codewords respectively. Denote by the entropy measured in nats. All results can be trivially generalized to use other bases of logarithm.
The length of the th codeword in the th channel is denoted by . The length tuple of the th codeword is . The (description) length of the codeword, defined by , is the number of nats required to represent it. Define .
The multi-channel analog to Kraft inequality and entropy bound were proposed in [2] for symmetric-alphabet channels. They were extended for asymmetric-alphabet channels in the conference paper [3] but their proofs were separated in the preprint [5]. For the sake of completeness, we reproduce the proofs in this paper.
Theorem 1 (Kraft Inequality [3]).
If is uniquely decodable, then the length tuples of its codewords satisfy
(1) |
Proof:
We use a similar technique as in [6]. Let be an arbitrary positive integer. Consider
where is the coefficient of in .
Note that gives the total number of sequences of codewords with a total length of symbols in the th channel for all . Since the code is uniquely decodable, these code sequences must be distinct. So, the number must be no more than the total number of distinct sequences where there are symbols in the th channel for all . That is, we have . By substituting this inequality into (II-C), we get
Since this inequality holds for any , we let and obtain our desired Kraft inequality. ∎
Theorem 2 (Entropy Bound [3]).
Let be a uniquely decodable code for a source random variable with probability and entropy . Then,
(2) |
where the equality holds if and only if .
Proof:
The single-channel Kraft inequality gives a necessary and sufficient condition for the existence of a single-channel uniquely decodable code. However in the multi-channel analog, the sufficiency does not hold in general. The reason can be easily seen from the rectangle packing formulation for prefix codes in [3]: The Kraft inequality only states that the sum of areas of blocks (codewords) is no larger than the area of the container (all possible words). The geometry of the blocks (the packability of the blocks in the container) is not captured. On the other hand, the interpretation of the entropy bound is consistent with the traditional version that the expected codeword (description) length is no less than the entropy.
II-D Multi-Channel Tree-Decodable Codes
A single-channel prefix code can be represented by a decoding tree. Each branch of an internal node (including the root node) corresponds to a symbol in the alphabet. When we decode a codeword, we traverse the tree from the root node. We read the symbols in the codeword one by one. For each symbol, we traverse through the corresponding branch. We can decode a codeword once we reach a leaf of the tree.
When it comes to the multi-channel case, each internal node is assigned to a channel. An internal node belongs to class if it is assigned to the th channel, and is only allowed to have at most children. The generalized depth of a leaf is defined as the length of the codeword that the leaf corresponds to. See Figs. 2(a) and 2(b) as examples of a multi-channel decoding tree. A codeword can be decoded by traversing the tree. Suppose the current node being traversed is of class , we read a symbol from the th channel of the codeword and traverse to the corresponding branch. Clearly decoding is sequential, i.e., without referring to the symbols of any future codewords. A code which has a decoding tree is called a tree-decodable code. The decoding tree of an optimal tree-decodable code is called an optimal decoding tree. Note that a tree-decodable code can have more than one decoding tree.
A tree-decodable code is a prefix code [2]. It is because for any two distinct leaves in a decoding tree, the codewords are not prefix of each other on the channel their lowest common ancestor corresponds to. The converse is not true in general for channels. For each channel, if there is a codeword of a prefix code having an empty string in this channel, then we cannot draw a decoding tree for the code. Below is an example given in [2].
Example 3.
Consider a binary -channel source code, i.e., . The codewords form a prefix code which is not tree-decodable.
The -channel setting is special in the sense that there cannot exist a codeword having an empty string in the first channel, and another codeword having an empty string in the second channel, since they are not prefix-free. With a more careful argument, we have the following theorem.
Theorem 3.
Any -channel prefix code is tree-decodable.
Proof:
We first argue that, for any -channel prefix code with at least two codewords, there exists at least one channel such that for all codewords the th component is not an empty string. We call such channel always non-empty. Suppose not, then there exist codewords and for some strings . Since is a prefix of both and , they are not prefix-free.
Next, given a -channel prefix code, we construct a decoding tree as follows. If the code has only one codeword, then we construct a tree with a single node associated to this codeword. Otherwise, suppose there are at least 2 codewords. By the argument above, there exists an always non-empty channel . Let the root node be of class . We partition the codewords depending on the first symbol of the codewords in the th channel into the chunks . For each chunk , we remove the first symbol of the codewords in the th channel. This results in a sub-code .
We argue that is prefix-free for all . Suppose not, let and be codewords in which are not prefix-free. Without loss of generality, suppose channel 1 was used for partitioning. Then before the truncation the codewords were and for some symbol . Clearly, since and are not prefix-free, and are also not prefix-free. This contradicts to the assumption that we start with a prefix code.
Since each is prefix-free, we can run the above procedure on recursively and link the resulting root node to the root node constructed above with the th edge. This results in a decoding tree. ∎
III General Results on Multi-Channel Source Codes
Towards a better general understanding of multi-channel source codes, we state a collection of elementary results regarding multi-channel source / tree-decodable codes which were not explicitly stated before. These results could help defining, analyzing, and approximating the multi-channel Huffman procedure in the remaining sections. These results are not necessary for understanding the multi-channel Huffman procedure, and can be skipped if the reader desires.
III-A Multi-Channel is No Worse than Single-Channel
Definition 4.
The trivial -channel extension of a single-channel source code expends each codeword into a -tuple where all the new components are empty strings.
Example 4.
If the channel used by a single-channel source code is the nd channel of a -channel system, then the trivial -channel extension of the code is to substitute every codeword into .
We first state the following trivial but important result by viewing -ary source codes as special cases of -ary source codes, which suggests the worthiness of studying multi-channel source codes.
Theorem 4.
The expected codeword length of an optimal -ary source code is no worse than that of an optimal -ary source code for all .
Proof:
The trivial multi-channel extension of an optimal single-channel source code is a multi-channel source code. Note that the length of every codeword is conserved, so the expected codeword length of an optimal multi-channel source code is no worse than the one of the above trivial multi-channel extension. ∎
III-B Source Coding Theorem for Symbol Codes
Let be the expected codeword length of an optimal -ary single-channel uniquely decodable code. One of the classical results by Shannon [7] which is known as the source coding theorem for symbol codes is that . That is, we can use no more than nats from the entropy to represent an optimal code. The following theorem, which is a natural multi-channel analog, extends a similar idea that we can use no more than nats (recall that ) from the entropy to represent an optimal code.
Theorem 5 (Source Coding Theorem for Symbol Codes).
Let be the expected codeword length of an optimal -ary uniquely decodable code. Then, .
Proof:
The first inequality is the entropy bound. For the second inequality, we construct an optimal single-channel -ary uniquely decodable code for the th channel. Let -its be its expected codeword length. We have , where the first inequality is by Theorem 4. As we know that for all , we have for all , i.e., we have . ∎
It is also not hard to show that the above upper bound is tight, which is stated in the following theorem.
Theorem 6.
is the tightest upper bound on which depends only on .
Proof:
Consider a multiset of probability masses where . Note that and . The optimal -channel code is the trivial -channel extension of the optimal -ary single-channel code. The codeword length is -it, i.e., nats. By taking , we have so that . ∎
III-C Local Redundancy of Tree-Decodable Codes
The concept of local redundancy was introduced in [8], which is a tool for understanding the redundancy of a tree-decodable code. An internal node is a non-leaf node in a decoding tree. In the decoding tree of a single-channel -ary source code, the local redundancy of an internal node is defined by where and are the reaching probability and the entropy of the conditional branching probabilities respectively of node . An interpretation is that we use nats to represent the branches of node due to the restriction of the alphabet size, i.e., there are at most branches in a -ary source code. However, minimally it can be done by nats. So, is the amount of nats wasted at node .
In the decoding tree of a multi-channel tree-decodable code, an internal node corresponds to a channel. Let be the alphabet size of the channel node corresponds to. The number of branches of node is . We use nats to represent the branches at node but minimally it can be done by nats. Therefore, we have the following straightforward extension of local redundancy.
Definition 5.
The local redundancy of an internal node is nats.
Denote by the index set of the internal nodes. Let be the expected codeword length of a multi-channel tree-decodable code measured in nats. We can show that by a conditional entropy argument. On the other hand, we can show that by a weighted bookkeeping argument. Then, by considering the difference , we arrive at a multi-channel generalization of the local redundancy theorem.
Lemma 1.
.
Proof:
Let be the height of the decoding tree. An outcome of can be represented by a path from the root node to a leaf node of the decoding tree. Let be the random variable for the node having depth which leads to the outcome of . If the outcome of is reached, then the random variables for the larger depths are deterministic, i.e., they have zero entropy. Then, and forms a Markov chain. As must be the root node, we have . Let be the index set of the internal nodes having depth . We have . Hence, . ∎
Lemma 2.
.
Proof:
Each internal node uses nats to represent the branches, which means that the node adds nats to the codeword length of each leaf. That is, the node increases the expected codeword length by nats. The proof is done by summing up all the increments made by the internal nodes. ∎
Theorem 7 (Local Redundancy Theorem).
The redundancy of a multi-channel tree-decodable code is nats.
IV Multi-Channel Huffman Procedure
Let be the probability distribution of the source random variable . Without loss of generality, assume that . It is well-known that the Huffman procedure [1] allows us to efficiently construct an optimal -ary single-channel source code, called a -ary Huffman code, for the information source [9]. Every iteration of the procedure selects the smallest probability masses and merges them into one mass, i.e., the smallest probability masses are removed from the multiset and then their sum is added back to the multiset. The codeword lengths for the selected masses are increased by nat. Every merge reduces the size of the multiset by . In order to merge all the probability masses into one single mass, we need to inject dummy symbols to the information source, i.e., inject dummy masses with zero probability into the multiset, before we start the procedure. The value of is the smallest non-negative integer such that , i.e.,
We call the leaves in the decoding tree which are assigned to dummy symbols the dummy leaves.
The core idea of the Huffman procedure is that the smaller the probability mass the longer the codeword length, so we can assign codewords of shorter lengths to those source symbols which are more likely to appear. This idea is also valid in the multi-channel case. A traditional proof, e.g., [10, Lem. 4.15], can be used to prove the following lemma.
Lemma 3.
In an optimal multi-channel uniquely decodable code, codewords with shorter lengths are assigned to larger probabilities.
Proof:
Suppose in an optimal code we have codeword lengths for the probabilities respectively. By exchanging the codewords assigned to these two probabilities, the expected codeword length of the code is changed by , which contradicts to the optimality of the code. ∎
When we adopt the Huffman procedure to a multi-channel source code, we have to decide how many probability masses we should merge with the constraint that the number of probability masses to be selected is for some . Note that the procedure can only produce tree-decodable codes.
Definition 6.
A merge sequence is a finite sequence where the th term is the number of probability masses (including dummy masses) to be merged by the Huffman procedure in the th iteration.
When we want to merge a certain number of masses in an iteration but there is more than one channel that can be used, i.e., they have the same alphabet size, we can arbitrarily choose a channel as the expected length is not affected and the code is still a prefix code.
The merge sequences for Figs. 2(a) and 2(b) are and respectively. We can see that it is not necessary to merge a smaller or larger number of masses in an iteration. The wrong decision may produce a code which is not optimal.
If a merge sequence is given, then the number of dummy symbols can be implicitly derived by . When the merge sequence is undetermined, however, it is unclear that how many dummy symbols are needed exactly. Nevertheless, we are able to derive an upper bound of the number of required dummy symbols. For some specific values of , e.g., , the upper bound forces to be .
We now state two intermediate lemmas regarding the dummy leaves in an optimal decoding tree, followed by the theorem about the upper bound on the number of required dummy symbols.
Lemma 4.
There exists an optimal decoding tree where the siblings of a dummy leaf are also leaves.
Proof:
Consider an optimal decoding tree. If the siblings of a dummy leaf are also leaves, then we are done. Otherwise, suppose there exists a dummy leaf such that one of its siblings is an internal node. Suppose there exists a non-dummy leaf in the sub-tree under this internal node. Then swapping the dummy leaf with this non-dummy leaf reduces the expected length of the corresponding code. Since we start with an optimal tree, this cannot happen. We can therefore assume that the leaves of the sub-tree under this internal node are all dummy leaves, then replacing the internal node with a dummy leaf also gives an optimal decoding tree. By repeating this procedure, we obtain an optimal decoding tree where the siblings of a dummy leaf are also leaves. ∎
Lemma 5.
There exists an optimal decoding tree where all the dummy leaves are siblings.
Proof:
Consider an optimal decoding tree. Suppose the set of all dummy leaves are siblings, then we are done. Otherwise, suppose there exists two dummy leaves which are not siblings. By Lemma 4, the siblings of them are all leaves respectively.
Consider one of these two dummy leaves. If all siblings of this dummy leaf are also dummy leaves, then we can construct another optimal decoding tree by replacing the parent of the dummy leaf by a dummy leaf. Therefore, without loss of generality, we can assume that some siblings of both dummy leaves are not dummy leaves.
We next argue that these two sets of siblings all have the same generalized depth. Suppose not, then swapping a non-dummy leaf of higher generalized depth with a dummy leaf of lower generalized depth reduces the expected length of the corresponding code. Since we start with an optimal tree, this cannot happen. We can therefore assume that the two sets of siblings all have the same generalized depth.
Since the two sets of siblings all have the same generalized depth, swapping leaves from the two sets do not affect the expected length of the corresponding code. We can therefore swap the leaves in such a way that we pack as many dummy leaves into one set as possible. There are two possible outcomes. Either one set is rid of dummy leaves, or one set is full of dummy leaves. For the latter case, we can use the argument above and replace the parent of these dummy leaves with a dummy leaf, and repeat the subsequent arguments again. For the former case, we can repeat the above argument until the set of all dummy leaves are siblings. ∎
Lemma 6.
Let be the number of non-dummy probability masses to be merged in the first round. Then the parent node of the leaves associated to the merged masses is assigned to class , where is the smallest among such that . Furthermore, the number of dummy leaves required is .
Proof:
By Lemmas 4 and 5, we know that there exists an optimal decoding tree where all dummy leaves are siblings, and all other siblings are non-dummy leaves. Let be the number of dummy leaves, and be the number of non-dummy leaves in this set of siblings. We argue that the parent of these siblings is assigned to class , where is the smallest among with , and . Suppose not, then the parent is assigned to some class with , and . By changing the class of the parent node from to , and reducing the number of dummy leaves from to , we obtain a decoding tree whose corresponding code has a lower expected codeword length. As we start with an optimal decoding tree, this is impossible. We therefore conclude that . ∎
Theorem 8.
The number of dummy symbols needed is bounded by where .
Proof:
Using the notation in Lemma 6, suppose . Note that , for otherwise is smaller than and yet , violating the definition of . Suppose otherwise that . Note that for otherwise we can replace the parent of the node to be “merged” by the node itself. To summarize, we have , where . ∎
We now know the range of the number of dummy symbols we need to add but not the exact number. On the other hand, we need to know how to merge these dummy symbols, say, should we merge all the dummy symbols in a single iteration, or should we merge certain number of dummy symbols for a specific channel? The following lemma describes the structure of a specific optimal tree-decodable code. Note that there may be other optimal tree-decodable codes which do not meet this structure. However from the lemma, we know that we can merge all the dummy masses in the first iteration.
Lemma 7.
There exists optimal decoding tree where all dummy leaves and codewords assigned to a certain number () of the smallest non-dummy probabilities are siblings.
Proof:
Now we continue with the optimal decoding tree stated in Lemma 5. Let be the class the parent node of the codeword for belongs to. This codeword must have at least one sibling assigned to an non-zero probability or otherwise we can remove the last symbol of the codeword in the th channel to reduce the expected codeword length. Let be the number of siblings of this codeword which are assigned to non-zero probabilities, where . Let one of these siblings be assigned to where . By Lemma 3, we have , i.e., . If , then we can swap their assigned codewords so that the codewords for and are siblings without changing the expected codeword length. We can repeat the above arguments to show that the smallest non-zero probabilities are siblings.
The final step is to show that when there is a dummy leaf, then it is a sibling of the codeword of . It can be proved by a similar argument that if the dummy leaves are not the siblings of the codeword of , then we can swap a dummy leaf with a non-dummy sibling of the codeword of so that the expected codeword length is either unchanged or reduced. ∎
At last, we can use the above lemma to argue with induction that one of the merge sequences produces an optimal decoding tree. That is, we know that the generalized Huffman procedure can indeed produce an optimal decoding tree, but we do not know the exact merge sequence.
Theorem 9.
There exists a merge sequence which can produce an optimal multi-channel tree-decodable code.
Proof:
Fix . By Lemma 6, we can determine , where is the smallest among with , and . When we merge the smallest probability masses, we obtain a reduced multiset of probabilities . Let be the expected codeword length of the optimal tree-decodable code for the reduced multiset of probabilities. Based on this reduced tree, we can reverse the merge to obtain a tree for the original multiset of probabilities where its expected codeword length is . Note that since is the expected codeword length of the optimal tree-decodable code for the reduced multiset of probabilities, is optimal for the original multiset of probabilities, for this specific . Since the method of creating the trees conforms to the format in Lemma 7, there must exist one which gives the overall optimal tree. By considering all possible , the expected codeword length of the optimal tree-decodable code for the original multiset of probabilities is . We can apply the above arguments inductively (but fixing in subsequent rounds where is the number of remaining masses) to all the possible reduced multisets of probabilities to conclude that there exists a merge sequence which can produce an optimal multi-channel tree-decodable code. ∎
The importance of the above result is that we do not have to try arbitrary merges on the probabilities: we only combine the smallest probabilities in each merge. This way, the search space is greatly reduced.
V Approximate Multi-Channel Huffman Procedure
It is not known whether an optimal tree-decodable code is an optimal prefix code except for the -channel case. If we further restrict ourselves to optimal tree-decodable codes, the multi-channel Huffman procedure can be applied.444A testing tree like the one discussed in [11] requires a source symbol to be the specific branches of the internal nodes. Also, the cost of an internal node is independent of the number of branches. That is, constructing an optimal tree-decodable code is different from constructing an optimal testing tree. However, the procedure raises another issue that in each iteration we have to decide how many masses are merged. Trying all possible merge sequences produces exponential number of decoding trees which is inefficient. Take -ary codes as an example, the number of possible merge sequences for probabilities is the th Fibonacci number.
One way to reduce complexity is to prune the merge sequences according to some metric. Specifically, we start by trying all possible choices in each iteration and keeping track of the resulting reduced multiset sizes. Whenever we obtain two merge subsequences producing reduced multisets of the same size, we eliminate the under-performing one according to some metric and continue with the remaining one. This strategy avoids unfair comparison between multisets of different sizes. For example, we should compare the reduced multisets generated by the merge sequences and as both of them reduced the number of masses by . In the following we show that using this strategy with several natural metrics is sub-optimal.
Given the probabilities , we apply the Huffman procedure to generate a -ary tree-decodable code. Note that we are worse off choosing the nd channel in the 1st iteration and merging a dummy mass with non-dummy masses, since we can instead choose the st channel to merge the non-dummy masses which results in a shorter expected codeword length. Therefore in the example here, we do not need to consider dummy masses. The candidate optimal merge sequences are , , , , and , while the winner is .
merge | number of remaining masses | |||
---|---|---|---|---|
sequence | ||||
(2,2,2,2) | 0.0072895611 | 0.0073186993 | 0.0139724304 | 0.024088589 |
(2,2,3) | 0.0072895611 | 0.0073186993 | 0.0337666569 | |
(2,3,2) | 0.0072895611 | 0.0084312615 | 0.0681102540 | |
(3,2,2) | 0.0113528472 | 0.0120340121 | 0.0153997900 | |
(3,3) | 0.0113528472 | 0.1027103422 |
merge | number of remaining masses | |||
---|---|---|---|---|
sequence | ||||
(2,2,2,2) | 0.2280454224 | 0.5254055629 | 0.9211926030 | 1.6143397835 |
(2,2,3) | 0.2280454224 | 0.5254055629 | 1.6240178515 | |
(2,3,2) | 0.2280454224 | 0.9652142681 | 1.6583614487 | |
(3,2,2) | 0.5943492482 | 0.9125038040 | 1.6056509846 | |
(3,3) | 0.5943492482 | 1.6929615368 |
merge | number of remaining masses | |||
---|---|---|---|---|
sequence | ||||
(2,2,2,2) | 1.3694953333 | 1.0721643310 | 0.6830310220 | 0.0000000000 |
(2,2,3) | 1.3694953333 | 1.0721643310 | 0.0000000000 | |
(2,3,2) | 1.3694953333 | 0.6334681881 | 0.0000000000 | |
(3,2,2) | 1.0072547937 | 0.6897814027 | 0.0000000000 | |
(3,3) | 1.0072547937 | 0.0000000000 |
merge | number of remaining masses | |||
---|---|---|---|---|
sequence | ||||
(2,2,2,2) | 1.5975407557 | 1.5975698939 | 1.6042236250 | 1.6143397835 |
(2,2,3) | 1.5975407557 | 1.5975698939 | 1.6240178515 | |
(2,3,2) | 1.5975407557 | 1.5986824562 | 1.6583614487 | |
(3,2,2) | 1.6016040418 | 1.6022852068 | 1.6056509846 | |
(3,3) | 1.6016040418 | 1.6929615368 |
During the Huffman procedure, we record the redundancy (sum of local redundancies) and the expected codeword length of the already constructed subtree(s) in Tables II and III. We also record the entropy of the reduced multisets of probabilities in Table IV. Table V shows the entry-wise sum of Tables III and IV. The values in Tables IV and V are the (lower and upper) bounds of the expected codeword length of the not-yet-constructed part and the expected codeword length of the resultant code respectively (by Theorems 2 and 5). All the values are measured in nats.
The numbers in bold are the minimums of the corresponding columns. For each table, we compare the steps of different merge sequences when the numbers of remaining masses are the same. If a merge sequence does not attend the minimum during a comparison, we prune the merge sequence away and highlight the corresponding cells in gray. The merge sequence with a white cell in the rightmost column is the output of the procedure for the specific metric of the table. We can see that none of these tables produce the optimal .
We now propose a straightforward suboptimal code construction which can guarantee a redundancy no larger than the one of a single-channel Huffman code. We follow a similar idea used in Table V except that, instead of adding the entropy of the reduced multiset of probabilities to the expected codeword length of the already constructed subtree(s), we add the smallest expected codeword length of the single-channel Huffman codes constructed for the reduced multiset of probabilities on different channels. This sum is the expected codeword length of a code which can be constructed explicitly. Note that a single-channel Huffman code falls in one of the merge sequences, so that we can ensure the code we produce this way has a redundancy no larger than a single-channel Huffman code.
merge | number of remaining masses | |||
---|---|---|---|---|
sequence | ||||
(2,2,2,2) | 1.6143397835 | 1.6143397835 | 1.6143397835 | 1.6143397835 |
(2,2,3) | 1.6143397835 | 1.6143397835 | 1.6240178515 | |
(2,3,2) | 1.6143397835 | 1.6583614487 | 1.6583614487 | |
(3,2,2) | 1.6056509846 | 1.6056509846 | 1.6056509846 | |
(3,3) | 1.6056509846 | 1.6929615368 |
VI Conclusion and Future Directions
In this paper, we showed that it is possible for a multi-channel source code to achieve a better compression than an optimal single-channel source code. We also presented the multi-channel analog of some classical results in single-channel source coding. Some future research directions on multi-channel source coding are:
-
1.
Is an optimal multi-channel tree-decodable code an optimal prefix (or uniquely decodable) code?
-
2.
Is there a metric with which the pruning strategy results in an optimal multi-channel Huffman code?
-
3.
How to extend adaptive, canonical and run-length Huffman codes into their multi-channel version?
-
4.
How to restrict the expected codeword length for each channel and to balance the usage of each channel?
References
- [1] D. A. Huffman, “A method for the construction of minimum-redundancy codes,” Proceedings of the IRE, vol. 40, no. 9, pp. 1098–1101, 1952.
- [2] H. Yao and R. W. Yeung, “Zero-error multichannel source coding,” in 2010 IEEE Information Theory Workshop on Information Theory (ITW), Jan. 2010, pp. 1–5.
- [3] H. H. F. Yin, K. H. Ng, Y. T. Shing, R. W. F. Lai, and X. Wang, “Decision procedure for the existence of two-channel prefix-free codes,” in 2019 IEEE International Symposium on Information Theory (ISIT), Jul. 2019, pp. 1522–1526.
- [4] M. J. Golin, C. Kenyon, and N. E. Young, “Huffman coding with unequal letter costs,” in Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing (STOC), May 2002, pp. 785–791.
- [5] H. H. F. Yin, K. H. Ng, Y. T. Shing, R. W. F. Lai, and X. Wang, “Decision procedure for the existence of two-channel prefix-free codes,” arXiv preprint arXiv:1904.12112, Apr. 2019.
- [6] J. Karush, “A simple proof of an inequality of McMillan,” IRE Transactions on Information Theory, vol. 7, no. 2, pp. 118–118, 1961.
- [7] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.
- [8] R. W. Yeung, “Local redundancy and progressive bounds on the redundancy of a Huffman code,” IEEE Transactions on Information Theory, vol. 37, no. 3, pp. 687–691, May 1991.
- [9] J. van Leeuwen, “On the construction of Huffman trees,” in The Third International Colloquium on Automata, Languages and Programming (ICALP), Jul. 1976, pp. 382–410.
- [10] R. W. Yeung, Information Theory and Network Coding. Springer, 2008.
- [11] R. W. Yeung, “On noiseless diagnosis,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 24, no. 7, pp. 1074–1082, Jul. 1994.