step[Step]
Hypothesis Testing of Mixture Distributions using Compressed Data
Abstract
In this paper we revisit the binary hypothesis testing problem with one-sided compression. Specifically we assume that the distribution in the null hypothesis is a mixture distribution of iid components. The distribution under the alternative hypothesis is a mixture of products of either iid distributions or finite order Markov distributions with stationary transition kernels. The problem is studied under the Neyman-Pearson framework in which our main interest is the maximum error exponent of the second type of error. We derive the optimal achievable error exponent and under a further sufficient condition establish the maximum -achievable error exponent. It is shown that to obtain the latter, the study of the exponentially strong converse is needed. Using a simple code transfer argument we also establish new results for the Wyner-Ahlswede-Körner problem in which the source distribution is a mixture of iid components.
Index Terms:
Mixture distribution, information-spectrum method, exponentially strong converse, Neyman-Pearson framework, error exponent.I Introduction
I-A Motivations & Related Works
Hypothesis testing with communication constraints is a classic problem in information theory. The problem was initiated by Ahswede and Csiszár in [1] as well as by Berger in [2]. It is assumed that a pair of data sequences is observed at separate locations. The sequence is compressed and sent to the location of via a noiseless channel. The decision center there decides whether is iid generated from the null hypothesis with a distribution or from the alternative hypothesis with a distribution . The Neyman-Pearson framework was used to study the trade-off between the probabilities of errors. The main interest was to establish the maximum error exponent of the second type of error when the probability of the first type of error is bounded as the number of samples tends to infinity. A single letter formulation was given in [1] for the testing against independence scenario and strong converse was proven for the general setting under a special condition. Various lower bounds for the general setting have been proposed in [3] and [4]. The work [3] also established the optimal error exponent in the zero-compression rate regime when is small enough. Using the same condition as in [1], the work [5] established the optimal -error exponent also in the zero-compression rate regime. The optimality of the random binning scheme proposed in [4] was shown for the conditional independence setting in [6]. In [7] the authors extended the testing against independence study to the successive refinement setting. The work [8] studied the non-asymptotic regime of the general setting under the two-sided zero-compression rate constraint. Yet, in all of the above studies, the distributions in both hypotheses are assumed to be iid.
Mixture distributions are prevalent in practice, cf. [9] for a comprehensive list of applications. However, hypothesis testing for mixture distributions is an under-examined direction in information theory. Notable studies are given in [10, 11, 12]. Communication constraints are not included in the above works. In this work we study the following binary hypothesis problem in which the hypotheses are given by
(1) |
Mixture of iid components is also known as mixture of repeated measurements or mixture of grouped observations in the literature. By no means exhaustive, we list a few works with applications in topic modeling [13], in cognitive psychology [14], and in developmental psychology [15], cf. also [16]. In machine learning applications, to allow flexible modeling it is often assumed that within each component the joint distribution has a product form without the requirement of having the same marginal distribution, cf. [17] and [18]. We keep the iid condition in the null hypothesis for tractable analysis. A motivating example is provided in the following.
Assume that two statisticians observe two bags of words and taken as excerpts from some documents where denote the number of words in each bag. For simplicity we assume that in each bag words have an identical marginal distribution and also the order of words is not important. Then it is likely that the bag of words () is not generated iid, since for example knowing the first word () to be in Latin (German) gives us a guess that the whole () is in Latin (German) [19]. Since the order of words does not matter, and are sequences of exchangeable observations. It is natural to approximate the distributions of and by finite mixtures of iid distributions, due to de Finetti’s theorem [20], in which each underlying state represents a topic. The two statisticians then form a hypothesis testing problem. In the null hypothesis, they assume that the two bags of words are generated jointly iid according to from an unknown topic with probability , for example is a direct translation or a synonym of for all . In the alternative hypothesis they assume that the two bags of words are generated independently and iid according to from an unknown topic with probability .
We similarly are interested in the optimal error exponent of the second type of error when the probability of the first type is restricted by some . The iid assumption of each component in the null hypothesis as well as the factorization assumption in the alternative hypothesis are used to facilitate the derivation. In contrast to the classic strong converse result in [1], the maximum -achievable error exponent generally depends on not only the prior distribution but also relations between different constituent components in the mixture distributions. When no compression is involved, the weak law of large numbers can be used to establish the maximum -achievable error exponent, cf. [11, 12]. However, this argument is no longer applicable in the presence of compression. We need to use an argument established through proving exponential strong converses of constituent components.
I-B Contributions
We summarize contents of our work in the following.
- •
-
•
We provide the optimal achievable error exponent in Theorem 3 and show in Corollary 2 that for small enough the obtained error exponent is also -optimal. Our proof is based on studying a compound hypothesis testing problem of differentiating between and under communication constraints. The results are established when and are stationary memoryless processes or finite Markov processes with stationary transition probabilities.
-
•
Under a further sufficient condition on the sets of distributions and , Assumption 2, we provide a complete characterization of the maximum -error exponent in Theorem 4. It is shown that even if the strong converse is available for the compound problem of testing against , it is not sufficient for establishing the maximum -error exponent in our mixture setting. Our derivation is based on the exponentially strong converse of the false alarm probabilities under the assumed condition.
-
•
We refine a recently established connection in [21] between the Wyner-Ahlswede-Körner (WAK) problem [22, 23] and the hypothesis testing against independence problem. While the previous result holds only under the stationary ergodic assumption, our new connection in Theorem 5 is valid under a more general assumption. We then use the refined connection to establish the corresponding minimum achievable compression rate and the minimum -achievable compression rate for the WAK problem with mixture distributions.
I-C Organization
Our paper is organized as follows. In Section II we review previous results and define quantities which are essential to characterize optimal error exponents in our study. We then establish various results for the compound setting, some might be of independent interest in Section III. In Section IV, we provide the maximum achievable error exponent in the mixture setting. We state the sufficient condition and use it to establish the maximum -achievable error exponent in the mixture setting in Section V. Then the refined connection between the WAK problem and the hypothesis testing against independence problem is given in Section VI. Consequences of the new connection are also given therein.
I-D Notations
We focus on finite alphabets in this paper. Before we begin we make the following conventions. Given a probability measure , denotes its -fold product measure extension. denotes the natural logarithm. For any two distributions and on an alphabet , assume that , if is absolutely continuous w.r.t , denoted by , holds then we define , otherwise we define , irrespective of whether is equal to 0 or not. We also define111For simplicity we use the convention in the following. when . The relative entropy between two distributions and is defined as where . For a joint distribution on if or holds then we define as is valid. Otherwise we define . The mutual information between and that is jointly distributed according to is defined as . For a finite set , we use and to denote its cardinality and its complement. For a mapping we define to be the cardinality of its range, . For a given distribution on and a stochastic mapping with the corresponding transition kernel we define
(2) |
II Preliminaries
We review previous results on the hypothesis testing problem with one-sided compression. Then we present an important assumption and essential quantities that are needed to establish results of our study in later sections.

In [1], the authors studied the following binary hypothesis testing problem. Deciding whether is iid generated from or using a testing scheme . Herein is a compression mapping,
(3) |
and is a decision mapping
(4) |
The setting is depicted in Fig. 1. The corresponding type I and type II error (also known as false alarm and miss detection) probabilities are given by
(5) |
where , , is the probability that outputs given the pair . The following achievability definition is repeatedly used in the subsequent analysis.
Definition 1.
For a given and an , is an -achievable error exponent of the second type for the binary hypothesis testing problem if there exists a sequence of testing schemes such that
(6) |
We define as the supremum of all -achievable error exponent at .
When for all holds, Ahlswede and Csiszár proved the strong converse result that for a given , does not depend on . They also provided a multi-letter formula for under the stated condition. In the case that holds, the formula reduces to222For notation simplicity, we suppress the cardinality bound for in the subsequent.
(7) |
A closer examination also reveals that in the case holds, where and are distributions on and satisfying and , we also have333In [5], the authors claimed in the introduction that Ahswede and Csiszár obtained a single-letter characterization when using the entropy characterization methods. However, in [1] only characterizations for and an arbitrary with were provided. Further details to validate this claim were not provided in [5]. for all
(8) |
In other words, Theorem 5 in [1] is tight in this case. We give a general relation using code transformations which leads to this observation in Appendix A.
This observation motivates us to study in this paper generalized problems of compound hypothesis testing and hypothesis testing for mixture distributions. These settings are based on finite collections of distributions and . Let
(9) |
We assume that can be enumerated as . For each , let be the set of indices such that is a the marginal distribution of on , i.e.,
(10) |
To characterize the optimal error exponents in the compound setting and the mixture setting, we need to make the following assumption.
Assumption 1.
The processes and are assumed to be in one of the following categories.
-
•
and hold for all and , respectively. We assume further that for all , and for all , hold. Accordingly, we define the minimum distances as444In case that does not return a unique value, we pick the first one.
(11) -
•
and are Markov processes of finite orders with stationary transition probabilities for all and . Compared to the first category we need an additional assumption which is for all , we have for all and similarly for all . We define the following limits, called relative entropy rates,
(12) The existence of these (relative divergence rate) limits are guaranteed by [24, Theorem 1]. Similarly, we assume that for all , and for all , hold. We further define
(13)
For notation simplicity we also define total minimum distances for all , ,
(14) |
The following quantities are used to characterize optimal error exponents in later sections555For clarity, we reserve pairs of random variables for distributions and pairs of random variables for distributions .
(15) |
where and for all , hold.
III Compound Hypothesis Testing
In this section we study the compound hypothesis testing problem of differentiating between two collections of distributions and satisfying Assumption 1. We derive the optimal achievable error exponent and provide partial strong converse results for this setting.
A testing scheme for this problem is similarly characterized by a pair of compression-decision mappings . For a given testing scheme , we define the following quantities which characterize the maximum type I and type II error probabilities
(16) |
Similarly we also use Definition 1 as the -achievability definition. For a given , the maximum -achievable error exponent is denoted by .
As our achievability result will be used in Sections IV and V, for notation compactness, we define the following auxiliary sets. For a given mapping and a positive number , define for each , an intersected decision set as follows
(17) |
In the case and hold, is a decision region based on likelihood ratio for testing against .
III-A Characterization of
The following result characterizes the optimal achievable error exponent in the compound hypothesis testing problem.
Theorem 1.
For a given compression threshold , we have . Furthermore, for a given positive number and any sequence of testing schemes such that is achievable we also have with
(18) |
The achievability proof of Theorem 1 is given in Appendix B. The converse proof of Theorem 1 follows from the one of Theorem 3. The proof of Theorem 1 uses a combination of techniques from [11], [25] and our new mixing idea. In the following we provide an overview of steps in the proof of Theorem 1.
Assume that the set of marginal distributions on , , consists of a single element. Assume further that the number of components in the null hypothesis is two, i.e. and . First we check whether the sequence is “typical” in the sense that
(19) |
This helps us to perform the change of measure step from to in the analysis of the type-II (or miss detection) probability. The above condition is violated with vanishing probability in the analysis of the false alarm probability. We then select a test channel and generate a codebook from the marginal distribution . In our proof we do not estimate from the sequence to avoid potential complications in the analysis of the miss detection probability. We artificially create instead the following joint distribution
(20) |
where are positive probability weights. shifts the burden from calculating the miss detection probabilities to bounding the false alarm probabilities, which is less complex. We then consider the following score function which is helpful in defining a deterministic decision mapping
(21) |
In the score function the first term resolves the uncertainty within the set of marginal distributions , while the second term resolves the mismatch between two sets of distributions and . The second term also indirectly checks whether is “typical”.
Given a chosen codeword, which we explain how to obtain later, we decide that the null hypothesis is true if is fulfilled. Given this decision the miss detection probabilities can be deduced based on the following chain of measure changing steps
(22) |
as well as the fact that holds, where the summation is performed over the decision region.
Now to obtain a transmission message index, we search for a codeword such that it yields the lowest maximum false alarm conditional probabilities by looking at
(23) |
In the analysis of the maximum false alarm probabilities changing measure from to in the expression of is relatively standard, cf. [11]. We then can use the standard typical arguments to conclude the existence of a good codebook. In the general case where the marginal set has multiple elements, we need to estimate . Because of the way that we design this extra step does not affect the exponent of the miss detection probability.
III-B A partial characterization of
We have the following result which provides a partial characterization of .
Theorem 2.
Given a positive number , define the inactive set .
-
•
If666We use the following convention: if then and . holds, then we have
-
•
If , then we have
-
•
Let be an optimality achieving index, i.e., . Assume that is active, i.e., . For an arbitrarily given , for any sequence of testing schemes such that the following inequalities are satisfied
(24) we have then , where is an index such that . It also implies that under this assumption we have for all .
A simple case in which the third statement in Theorem 2 holds, is when for all . The proof of Theorem 2 is given in Appendix C. The second part of Theorem 4 will be employed in proving the converse of Theorem 4 for mixture models in Section IV.
Recall that represents the set of distributions which have the same marginal distribution on , cf. Equation (10). In the following we provide an outline of the proofs of the first two points in the statement Theorem 2.
We discuss in this paragraph the first item in the first part of Theorem 2. When is active, i.e., holds, then it follows from the strong converse bound for testing against for each inside the class , that holds for all , cf. Theorem 7. Therefore we only need to focus on the inactive set . For simplicity, in this discussion we can assume that holds and there are two components inside . We assume further that there is no mismatch, i.e., and hold. The formulation of requires the selection of a test channel . To show the strong converse bound, the general idea is hence to identify a common test channel . This can be done by considering relevant sets , , of such that for each
holds, where and are arbitrary. By setting and using the reverse Markov inequality, we obtain the following inequalities
(25) |
We require that the intersection should be non-empty. This allows us to define a joint distribution where is supported on . Note also that -restriction in the definition of allows us to obtain the following inequality
With this we can identify for all . The variational arguments in [26] can be used to indeed show that also holds. To make non-empty, we must have . When the inactive set contains more than one element we obtain the corresponding threshold .
Now we discuss about the second item in the first part of Theorem 2. Since holds for all and , cf. Theorem 7, we only explain the achievability direction of the second item. For simplicity we assume that the set of marginal distributions has a single element, the element is inactive , and . Our achievability idea is to build two sequences of testing schemes separately and then mix them together. For this we need to divide the space into partitions and such that for all for all sufficiently large . Since , such partitioning can be done. For each , we design a sequences of testing schemes to differentiate between and such that is achievable. As in the proof of Theorem 1 we also define an auxiliary mixture distribution
(26) |
The mixture distribution helps to alleviate the estimation of the distribution of . Once the preparation is complete, we perform the compression as follows.
We first check if is a typical sequence in the sense that whether the following condition is fulfilled or not
(27) |
Again the above inequality also helps resolving the mismatch between two sets of distributions and . Suppose that is typical. If then we use to compress it, and similarly when we use to compress it. The joint compression mapping is then
(28) |
The joint compression mapping induces the following distribution from
(29) |
We also define the following score function
(30) |
We say is true if holds. Let us look at which can be upper-bounded as
(31) |
where holds since when is in and is typical, we use to compress it. We have by construction. The first term can be shown to be vanishing using similar steps as in the proof of Theorem 1.
We note that Theorem 1 only guarantees that is achievable which is below , our desired error exponent in this part of Theorem 2. The collection of sets is used to resolve the confusion about which compression mappings we should use when is not active. We are willing to pay an additional error probability price for using this collection. The general case is a little bit more complicated but follows the same principles as we discuss herein. Note that when is active, we do not need to divide into a collection of subsets as above. This is because by Theorem 1 we can design a sequence of testing schemes to differentiate between and such that is achievable.
Remark 1.
We have . The inequality can be strict which can be shown numerically. This means that in general strong converse does not hold for the compound testing problem. In other words, depends on .
IV Testing Against Generalized Independence
In this section we consider the hypothesis testing problem involving mixture distributions. We use results and techniques from Section III to establish the optimal achievable error exponent in this section. We begin with our model’s definition.
Assume that we have two sets of distributions and , which fulfill the conditions given in Assumption 1.
For given , let the distribution under the null hypothesis be defined as
(32) |
Similarly the distribution under the alternative hypothesis is given by
(33) |
where for all , , and . For notation simplicity in the subsequent analysis we define . We name this problem testing against generalized independence.
The model of subsumes the following two cases:
-
•
testing against independence in which hold,
-
•
and testing against (unobserved) conditional independence in which hold.
For a given pair of compression-decision mappings , we define the corresponding type-I and type-II (false alarm and miss detection) probabilities as
(34) |
Similarly as in Definition 1 we say that is an -achievable error exponent at a compression rate for testing against if there exists a sequence of testing schemes such that all the conditions in (6) are satisfied. We denote the maximum -achievable error exponent at the given rate by . We first characterize the optimal achievable error exponent in the testing against generalized independence problem .
Theorem 3.
For a given compression rate , in testing against using one-sided compression, we have
(35) |
We first provide a remark about the first equality in the statement of Theorem 3. In this we highlight the difference between our model and a previous study.
Let us consider the case that for all pairs holds. Assume that is achievable in the mixture setting via a sequence of testing schemes . Since and for all , we have . Similarly, for an arbitrarily given and for all sufficiently large we have . Since for all , holds, we have . This implies is an achievable error exponent in the compound hypothesis testing problem with the corresponding sequence of testing schemes . Hence holds. The arguments discussed herein are similar to the ones given in [12] when data are not compressed. In our proof of Theorem 3, we only need the restriction that for all and .
We explain the idea of showing in the following. For simplicity assume that there is no mismatch, i.e., as well as hold. Assume that is an achievable error exponent via a sequence of testing schemes . A central idea of the Neyman-Pearson framework is to consider a decision region based on the likelihood ratio. An advantage of working with a likelihood-based decision region is that elementary set operations such as intersection, contraction, etc. can be performed through simple change of measure steps either in the numerator or denominator of the likelihood ratio. We want to show that if is achievable then roughly
(36) |
The term inside the bracket is a rejection region based on the likelihood ratio for testing against . Then based on the definition of the spectral-inf mutual information rate as well as the fact that the spectral-inf mutual information rate is bounded by the inf-mutual information rate, we can arrive at a conclusion that for an arbitrarily given , and for all ,
(37) |
for all sufficiently large . Then we can use the standard single-letterization method to obtain that . In order to obtain the conclusion in (36), we need to perform several change of measure steps. First we form a decision region based on the likelihood ratio of and as well as the achievable error exponent . Then it can be shown that . We then do the first change of measure step from to to obtain
(38) |
This can be seen from the definition of , as . Next we need to change the measure inside the definition of . Roughly we want to show that the following inequality holds
(39) |
Changing the measures in the denominator from to and to can be done based on inequalities and . These inequalities follow from the definition of and hold for all and . The change from to is not based on the definition of . However we can show that it holds with high probability. The proof of the general case involves another change of measure step from to using our code transformation arguments.
Proof.
Assume that is an achievable error exponent at a compression rate in the compound hypothesis testing problem via a sequence of testing schemes . We have by definition
(40) |
Applying this sequence to the current testing against generalized independence setting we obtain
(41) |
We obtain that
(42) |
Hence is also achievable in our testing against generalized independence setting, cf. Definition 1. Therefore we have
(43) |
Now for an arbitrarily given , assume that is a sequence of testing schemes such that
(44) |
hold. Define for each the following decision region based on the likelihood ratio
(45) |
By [11, Lemma 4.1.2] and the definition of , we have for all
(46) |
Let be a sequence such that and as . For each , we define a set
(47) |
We then have
(48) |
contains high probability pairs when we perform the change of measure from to in the numerator of the likelihood ratio test in the definition of . To make the derivation more compact, we further define two following sets
(49) |
is a rejection region in testing against . is a rejection region of testing against , which is our first desired test. From the definition of , we know that for all pairs , and for all and , the following inequality holds
(50) |
This implies that holds. Furthermore for we have
(51) |
Using the above analysis we perform in the following two change of measure steps
(52) |
where
-
•
in we change the measures from to ,
-
•
in we change the measures from to .
In summary we have
(53) |
In combination with (44), since as holds, we obtain
(54) |
In the next step, for each , and , we will perform change of measure from to . is a compression mapping for each class that is constructed from .
For a given , consider the problem of differentiating between and , where , via the testing scheme . The corresponding error probabilities are given by
(55) |
Note that by the definition of we have
(56) |
We want to transform the given testing scheme to obtain a new testing scheme , , for differentiating between and . This can be done using similar arguments to those given in Appendix A as follows. For the given positive we define for the given a typical subset of
(57) |
Similarly for the given we define a typical subset of
(58) |
Then the new compression mapping is defined as
The decision mapping is defined as
Using this testing scheme we can bound the error probabilities in testing against as
(59) |
where . Then using [11, Lemma 4.1.2] we obtain the following inequality
(60) |
holds where
(61) |
is our desired decision region using the likelihood ratio test in testing against . Using the inequalities in (59) we obtain
(62) |
Under Assumption 1, and as due to either the weak law of large numbers or Theorem 1 in [24]. Since both , and as hold, by combining (54) and (62) we have for all
(63) |
where holds. Hence, for all , , we have
(64) |
where and is the spectral-inf mutual information rate, defined for a joint process as
(65) |
Since the spectral-inf mutual information rate is less than or equal to the inf-mutual information rate by [11, Theorem 3.5.2]
holds, , we have
(66) |
For each , let be such that
(67) |
Then for all , we have
(68) |
Let be a uniform random variable on and independent of everything else. For each and , we define for all and . Therefore for all and , as well as for all sufficiently large , say , we have
(69) |
where follows since , and is valid since forms a Markov chain. Similarly we also have
(70) |
Note that for , we have . We define this common kernel for each as . Therefore for all and all sufficiently large we have
(71) |
where for all we have . By standard cardinality bound arguments [27] and taking we have that for all
(72) |
Hence holds for all , which leads to . ∎
V -Error Exponent in Mixture Setting
In this section we characterize the maximum -achievable error exponent in the testing against generalized independence setting in Section IV.
V-A Small -optimality of
The following partial result is an immediate consequence of the first part of Theorem 2. It states that when is small enough then the maximum achievable error exponent is also -optimal.
Corollary 1.
For a given , if , where the inactive set is defined as in the statement of Theorem 2, then
(73) |
Proof.
Assume that is a sequence of testing schemes such that the conditions in Definition 1 are satisfied for the pair
(74) |
For an arbitrarily small and for all , , we have for all . Furthermore we also have
(75) |
which implies that we have . By the proof the first part of Theorem 2 we obtain then . Since is arbitrary the conclusion follows. ∎
V-B A sufficient condition for characterizing
To obtain a full characterization of we need to make an additional assumption. By [28, Theorem 2.5] the maximization in the expression of , , can be restriction to the set . For each , we define to be the set of optimal solutions of within . We define a set as follows:
(76) |
For an arbitrarily given , the set of can be arranged as
(77) |
This corresponds to a permutation on . It can be seen that when is not empty then the permutation does not depend on a specific in . To account for more general scenarios, we make the following separability assumption.
Assumption 2.
For given set of distributions and set of sequences of distributions , class and , any two kernels and in satisfy , i.e., the order is invariant to the change of kernels inside .
It can be seen that Assumption 2 is satisfied when for all , i.e., when all have distinct marginal distributions. In the following we briefly discuss two non-trivial scenarios in which Assumption 2 can be fulfilled.
Example 1: In this example we assume that and hold. This implies that for all and for all , we have and . We then assume further that within each class the set of channels can be ordered according to the less noisy relation777A channel is less noisy [29] than a channel if for every we have .. Then for all , we have
(78) |
for some fixed order which does not depend on whether is empty or not. Hence Assumption 2 is satisfied.
Example 2: We consider another example in which the set of distributions in the null hypothesis is . We assume further that is an erasure channel with erasure probability . We also assume that and hold in this example. Then , which can be achieved by any kernel such that . This implies that holds, a non-empty set, and hence Assumption 2 is satisfied.
Assumption 2 implies that at a given we have
(79) |
This can be seen as follows. If , we can take any kernel inside to achieve for all . Hence (77) leads to (79). Otherwise, we take a kernel to obtain since holds and so on. Furthermore we observe that for all , we have
(80) |
can be verified as follows. If , we can take any kernel inside to achieve . Otherwise, holds, then follows. Therefore, we have the following relation
(81) |
Assume that at a given , Assumption 2 is valid. By relabeling elements in the set if necessary, we assume that is an increasing sequence. An example of such ordering is given in Fig. 2.

For each , for notation simplicity we define for each a left-over subset of as . Since the ordering is unique, (79) implies that . Therefore, when , we have
(82) |
The above analysis leads to the following result.
Proposition 1.
Assume that at a given Assumption 2 is fulfilled and is an increasing sequence. Then for each the following holds
(83) |
The left-hand side in (83) is the maximum achievable error exponent in testing against . This result will be used to establish the optimal -error exponent in this section.
Proof.
V-C Characterization of under Assumption 2
A complete characterization of under Assumption 2 is provided in the following.
Theorem 4.
Assume that a given , Assumption 2 is true and is an increasing sequence, then we have
(85) |
The behavior of is depicted in Fig. 2. The proof of Theorem 4 uses the result from the second part of Theorem 2. It demonstrates an application of showing exponentially strong converse. We briefly describe the proof idea in the following.
Fix an such that holds. We build an -achievable sequence of testing schemes based on a compound hypothesis testing problem in which we need to specify two sets of distributions in the null and alternative hypotheses. The set of distributions in the alternative hypothesis is
(86) |
Since we are interested in showing -error is achievable, we select the set of distributions in the null hypothesis as
(87) |
We omit the other distributions for in the null hypothesis of the above problem because in the mixture setting they contribute only to a total error probability up to which is what we desire. Given a sequence of testing schemes such that is achievable in testing against , we obtain as by-products the collection of intersected decision regions with vanishing complement probability, cf. Theorem 1.
We use to compress . Next we select a decision region for testing against such that for all , holds. This ensures that for all . Hence, asymptotically we have .
As in the converse proof of Theorem 3, assume that we have a likelihood based decision region for testing against . Then we use change of measure steps to obtain for each a likelihood based rejection region of testing against , called . Then for all sufficiently large we have
(88) |
We show that for all we have converges to 1 when holds. Taking the limit superior on both sides on the above inequality, we obtain that , a contradiction. Hence we must have .
When no compression is involved, is the identity mapping, the convergence of to 1 can be seen from the weak law of large numbers since is characterized by the likelihood ratio. In our case this is not possible since the likelihood ratio can not be factorized as the sum of identical components. Using the strong converse arguments which guarantee the convergence in the limit superior sense, it can be seen that if holds, then we must have . This approach does not match the achievability result and does not yield conclusive information when . The required convergence of is guaranteed by the second part of Theorem 2 since holds for all due to Assumption 2.
Proof.
Achievability: Given , consider a reduced compound problem of designing a testing scheme to differentiate between and . By Theorem 1 and Proposition 1, for each , there exists a sequence of testing schemes for this reduced compound setting such that is achievable, i.e.,
(89) |
In contrast to the achievability proof of Theorem 3, although we can use the same sequence of compression mappings to compress , we need to define a new sequence of decision mappings . For each and given the compression mapping , define the following measure
(90) |
Define an acceptance region for our testing against problem as follows
(91) |
Then the probability of miss detection is given by
(92) |
For the given , the false alarm probability is given by
(93) |
For each , if then we have
(94) |
This means that , cf. (17) for the definition. In summary we have
(95) |
Therefore we have
(96) |
This implies that if .
Converse: Given an and an arbitrary , assume that is -achievable via a sequence of testing schemes . In a similar fashion as in the converse proof of Theorem 3 we have,
(97) |
where the rejection regions are defined as in (49). Assume that holds. This implies that for all we have . For each , we apply the second part of Theorem 2 to the problem of testing against where via the sequence of testing schemes to obtain that
(98) |
Since as , this implies that
(99) |
a contradiction. Therefore we must have and hence . ∎
VI A refined relation to the WAK problem
In this section we use the techniques and results from the previous sections to establish new results for the WAK problem in which the joint distribution is a mixture of iid components.
We first recall the definition of the WAK problem. Assume that we have a joint source which takes values on an alphabet where is finite or countable infinite. A code for the WAK problem is a triple of mappings
(100) |
In the WAK setting we aim to control the error probability where . The achievable region has been characterized using the information-spectrum formula in [30]. We are interested in single-letter formulas for the (-) achievable regions. To obtain these we relate the WAK problem to the testing against independence problem with general distributions.
The hypotheses are given by
(101) |
where is a distribution on . Similarly we use (3) and (4) for the definition of a testing scheme in this case. Additionally, Definition 1 is taken as the definition of -achievability.
When is a finite or countable finite alphabet, and the process in the WAK problem is a stationary and ergodic process which has a finite entropy rate, as well as and , a generalized relation between the WAK problem and the hypothesis testing against independence problem has been established in our recent work [21]. The relation allows us to transfer results from the testing against independence problem to the WAK problem and vice versa. However it is not strong enough for our current interest. In the following we study a refined relation between the two problems.
In this section we similarly assume that where specifically is a finite alphabet. We further assume that where is the uniform distribution on . For simplicity we call this setting U(niform)-HT. Define a set
(102) |
For a given WAK-code , and an arbitrary number , we have
(103) |
where holds as for a given the number of satisfying is upper bounded by . Therefore, we obtain
(104) |
Additionally, given a testing scheme for the U-HT problem and an arbitrarily positive number we also have by [11, Lemma 4.1.2]
(105) |
Similar as in [21, Theorem 2] we have the following result.
Theorem 5.
Given positive numbers and .
-
•
From a WAK-code , we can construct a testing scheme for a U-HT problem such that
(106) -
•
For a given U-HT testing scheme , there exists a WAK-code such that
(107)
Proof.
U-HT WAK: given a WAK-code we design a U-HT testing scheme as follows. We use to compress in the U-HT setting. A decision region for the U-HT setting is given by . Then the false alarm probability is upper bounded as
(108) |
The miss detection probability is upper bounded as
(109) |
where follows since for we have .
U-HT WAK: given a U-HT scheme we show the existence a WAK-code as follows. We use to compress in the WAK problem. We randomly assign to an index in an alphabet . For each we denote the corresponding (random) set of such by . We declare that is the original source sequence if it is the unique sequence satisfying and . For each , the cardinality of the set of satisfying is upper bounded by . There are two sources of errors:
-
•
either holds,
-
•
or there exists another for which and hold.
The probability of the first event is upper bounded as
(110) |
The probability of the second event is upper bounded by because each sequence is assigned to a bin with probability and the number of such sequences satisfying the second event is upper bounded by . Hence, it can be seen that
(111) |
∎
Fix an . Let be the closure of all such that there exists a sequence WAK-codes which satisfies
(112) |
Define . We observe that for all and . By setting in (105) we obtain that . The following result summarizes the relation between the minimum encoding rate in the WAK problem and the maximum -achievable error exponent in the U-HT problem. As in [21, Theorem 3] we have the following result.
Corollary 2.
For any given and , we have
(113) |
Proof.
We consider the extreme case where holds. Assume that is an achievable error exponent in the U-HT problem with the corresponding sequence of testing schemes . We take where . By plugging it into the second part of Theorem 5 and choosing , we obtain a sequence WAK-codes such that . The corresponding compression rate is , a contradiction. Therefore in this case we must have . The other cases can be worked out similarly as in the proof of [21, Theorem 3]. ∎
Assume now that the source distribution in the WAK problem (and the distribution in the U-HT problem) is given by . Further we assume that for all , and as well as hold. Therefore we can use results from Section III as follows. The quantities , , in this case are given by
(114) |
Therefore using Corollary 2 we obtain
(115) |
Similarly if Assumption 2 hold and is an increasing sequence, the minimum -achievable compression rate at is given by
(116) |
where in this case for all we have
(117) |
Appendix A Hypothesis testing with two-sided compression
In this section we study code transformations between hypothesis testing problems with compression at both terminals. Although the setting considered in the following is not directly related to our main problem, the arguments presented herein however are useful in simplifying proofs of main results in this work.
Assume that is available at Terminal 1. Further is available at Terminal 2. At a given number of samples a generic testing scheme involves a triple of mappings where
(118) |
Similarly as in (5) the generic error probabilities and are defined as
(119) |
We study a relation between two hypothesis testing problems which have the same distribution in the null hypothesis. The first problem involves the following hypotheses
(120) |
whereas the second problem considers
(121) |
We make the following technical assumptions
(122) |

An overview of our derivation is depicted in Fig. 3. Let be a testing scheme for differentiating between and in (120). We construct a testing scheme for testing against in (121) as follows. For an arbitrarily given , we define typical sets , and as
(123) |
where and are finite numbers.
We define the compression mapping as follows
(124) |
Similarly the compression mapping is defined as
(125) |
The corresponding decision mapping , is given as
For notation simplicity we define for and , . The set of all for which and hold is which can be factorized further as
(126) |
In differentiating between and the induced false alarm probability by the testing scheme is upper bounded as
(127) |
Similarly the probability of miss detection is bounded by
(128) |
Given a testing scheme for testing against we apply a similar procedure as from (124) to (A), by swapping the positions of and for , in (124) and (125) as well as switching the roles of and in (A), to obtain a testing scheme for testing against .
The induced false alarm probability in testing against is similarly upper bounded by
(129) |
The induced miss detection probability in testing against is upper bounded as
(130) |
For a given pair of compression rates , let be the maximum -achievable error exponent for testing against . Similarly let be the maximum -achievable error exponent for testing against . To establish a relation between and , we make the following assumption.
Assumption 3.
The sequences of joint distributions , , , and the quantities , , satisfy (122) and the following conditions
(131) |
We give some examples in which Assumption (3) is satisfied.
- •
-
•
Assume that , are stationary and ergodic processes as well as , , and , are finite order Markov processes with stationary transition probabilities such that the conditions in (122) are satisfied. Assume further that the relative divergence rates
(134) are finite. With , and , the conditions in (131) are fulfilled by [24, Theorem 1].
Assume that Assumption 3 is fulfilled, then (127) and (128) imply that
(135) |
Conversely, (129) and (130) imply that
(136) |
In conclusion we have shown the following result.
Theorem 6.
Given and , under Assumption 3 we have
(137) |
As a corollary for this result by setting and we have the following result which states that Theorem 5 in [1] is tight for a special case.
Theorem 7.
Assume that the sequences of distributions and satisfy Assumption 1 with , , , and . In testing against using one-sided compression of the sequence , the maximum -achievable error exponent is given by
(138) |
where in the first category , and in the second category .
Appendix B Proof of Theorem 1
B-A A support lemma
For the achievability proof of Theorem 1 we need the following support lemma.
Lemma 1.
Let be either or in Assumption 1. For each , let be the corresponding or . For each , define where for simplicity are fixed positive numbers and . Also for each , we assume that . Then for an arbitrarily given we have
(139) |
Proof.
It can be seen that
(140) |
If belongs to the second category, we always have for all . This leads to for all . The second inequality then follows since for all holds.
Assume now that lies in the first category. When holds, i.e., for all we have , then the set can be omitted since it has zero probability. If , for example when there exists a satisfying , then for all such that we have which in turn violates the inequality . Therefore in both cases we only need to consider the set , in which the second inequality follows from the definitions of , and as well as the fact that for all holds. For notation simplicity, we define . We examine two cases in Assumption 1 separately in the following.
-
•
Consider the first case in which is the set of product distributions on a common alphabet . To avoid dealing with cumbersome extended real number operations, we define a set . We further have888Without considering , we might encounter an indeterminate form in the third expression.
(141) If then by the weak law of large numbers (141) goes to 0 as . When , let be the largest subset of such that and for all . By our assumption we have . And if then holds. Therefore we have
(142) Therefore
(143) - •
∎
B-B Existence of a testing scheme
Let be the minimum (total variation) distance between any two probability distributions in the set of marginal distributions on , . Denote by the type of a sequence . We define a mapping as follows. We look for a unique such that . If such an exists, we set . Otherwise we set .
For a given and for each , let be a probability kernel which achieves . Define
(145) |
Let be the corresponding marginal joint distribution on . For notation simplicity we define the following score function999By our assumption , as if then there exists an such that , therefore is well-defined.
(146) |
For each , and , let be a tuple of random variables such that . Let be arbitrary but given. For each draw codewords , , from the marginal distribution of . Given , let be an estimate of the index of the marginal distribution of . Assume that . We proceed with the compression process if
(147) |
where is the marginal on of . Otherwise we send a special index .
When (147) is fulfilled, we then select a transmission index to be where for all
(148) |
If , we set . The pair is provided to the decision center. If , and hold we declare is the underlying distribution if
(149) |
Otherwise we declare that is true. Given a codebook realization, let and be induced random variables under the null hypothesis. For each , , the corresponding false alarm probability is given by
(150) |
For notation simplicity we define for each , . The first term can be upper bounded further as
(151) |
Let be another tuple of generic random variables that is independent of the random codebook such that . Furthermore for the given codebook realization let be the random message induced by through the encoding process. Then we have
(152) |
Averaging over all codebooks we obtain by Fubini’s theorem
(153) |
By the non-asymptotic covering lemma in [25, Lemma 5] we have
(154) |
This result implies the following chain of expressions
(155) |
For each let be a tuple of random variable such that and hold. From the definition of we also have
(156) |
In summary there exists a codebook realization, hence a mapping , such that
(157) |
holds.
B-C Bounding error probabilities
Due to Lemma 1, goes to 0 as . By the weak law of large numbers the terms and go to as . We focus now on the first term. We observe that
(158) |
Similarly, the last term goes to 0 due to Lemma 1. In the next step we perform several change of measure steps from the general distribution to our distributions of interest , . For that purpose, for each , we define the following sets
(159) |
where and as . We have
(160) |
For as holds by the definition of , we further have
(161) |
This leads to
(162) |
For an arbitrary select which implies that holds. By the weak law of large numbers we obtain
(163) |
Therefore we have
(164) |
Define the following sets
(165) |
and
(166) |
is our decision region described in (149). Using and we perform in the following change of measure steps from to and from to in the calculation of the miss detection probability. For each and we have
(167) |
where the second last expression follows since implies that , otherwise in the previous line the first term is while the second term is , and due to our coding arguments . Furthermore for , as holds, we also have
(168) |
Therefore, for each and the probability of miss detection is bounded by
(169) |
This implies that we have
(170) |
Therefore, the chosen sequence of satisfies
(171) |
Combining with the definition of intersected sets this further implies that for all we have
(172) |
Appendix C Proof of Theorem 2
For an arbitrarily given , define the following typical sets
(173) |
We have
(174) |
either due to the weak law of large numbers or due to [24, Theorem 1]. For simplicity we first consider the case that there is a single marginal distribution in the set . In this case we simply write as , as , as , and as . Furthermore, for notation compactness we define the following quantities in this case
(175) |
To support to analysis we define . Given an arbitrary joint distribution , consider the following region
(176) |
It can be seen that only depends on marginal distributions but not on the joint distribution . Without the loss of generality we assume that in the evaluation of , . In the following we use the hyperplane characterization of . For that purpose, we first show the following result.
Lemma 2.
is a closed, convex set. Furthermore, is a concave function.
Proof.
Assume that are two points in with corresponding kernels . Assume that is a random variable taking values on with , , and independent of everything else. Define . Then we have
(177) |
Therefore, the convex combination lies also in , or is a convex set. It can also be seen that is a closed set since all alphabets are finite. Let be the reflection of the set via the -axis. Then is a convex set. Furthermore the function
(178) |
is a convex function. This implies that is a concave function. ∎
For any define
(179) |
Since is a closed convex set, the line supports or101010Assume that and hold. Then there exists a pair such that for all as is a closed convex set. Since , we must have . This implies that either or is negative by plugging in. Similarly plugging with sufficiently large in we see that is positive and is negative. Setting , we see that if then . Hence holds. The other direction is straightforward. . For our proof we also use an additional characterization of which is stated in the following. For any pair of positive numbers we define
(180) |
where is a joint probability measure on satisfying and . By the support lemma in [27] we can upper bound the cardinality of by a constant. An alternative characterization of is given in the following.
Lemma 3.
(181) |
C-A Strong converse proof for
We now present the main part of the proof of Theorem 2. In the proof, we will use a recent technique by Tyagi and Watanabe [26]. When the inactive set is empty, , showing that for all can be deduced from the strong converse result of testing against for all , cf. Theorem 7. Without loss of generality, we assume in the following that the inactive set is not empty .
Assume first that the set of marginal distributions on , , is a singleton. This implies that holds. Given a sequence of testing schemes such that the -achievability conditions are fulfilled
(182) |
we define for each the following likelihood based decision region
(183) |
By [11, Lemma 4.1.2], cf. also [31, Lemma 12.2], we obtain
(184) |
For each , can be seen as a testing scheme for differentiating between and . For a given we now a construct a testing scheme to differentiate between and . Given , the compression mapping does not depend on . Our arguments are similar to the code transformations given in Appendix A. We present the procedure in the following for completeness. Given , is defined as
For each , the decision mapping is defined as
In the above definitions the typical sets (or ) and are defined in (173). Then we also have
(185) |
Let be the corresponding acceptance region of . We take to be sufficiently large such that the following conditions hold for all
(186) |
Next, we can further decompose as
(187) |
For simplicity define . Consider the set
(188) |
where . Then we have
(189) |
which implies that . Using this inequality we further obtain
(190) |
For , we require that
(191) |
must hold. This in turn implies that . Taking , we obtain . Define , , and the following distribution on
(192) |
For a given we also define the following joint conditional distribution on
(193) |
Additionally, we define for all if . Let be a tuple of general sources such that
(194) |
Then for each we have the following inequality
(195) |

Consider an arbitrarily fixed sequence . For the following analysis, the illustration given in Fig. 4 is perhaps helpful for readers. Let be a message index such that and hold. For all , we then have
(196) |
If is another message index such that and that hold, then for all , we have . When , we also have . Hence for all , we obtain
(197) |
We observe that not only holds but also we have the following bound
(198) |
We now derive bounds on the compression rate and the -achievable error exponent using expressions involving . We first have
(199) |
where . Note further that the following Markov chain holds
(200) |
For each the support set of the joint distribution is a subset of the following set, cf. Fig. 4 for a visual illustration,
Then compared with (187) we have due to the restriction on . Therefore, on one hand we have
(201) |
On the other hand from (195) and (197) we have
(202) |
where holds because . The last equality is valid because for we have . This leads to the following
(203) |
where we have use log-sum inequality in and the summation therein is taken over and . The last inequality holds since for we have . Combining (185), (201), and (203) we obtain
(204) |
By combining those two previous inequalities in (199) and (204), we obtain for arbitrarily given positive numbers and
(205) |
where
(206) |
Let be a uniform random on . Furthermore, define for all , and . We show at the end of this subsection that
(207) | ||||
(208) |
In summary we obtain for all positive and that
(209) |
Taking and supremum over and using Lemma 3, we have shown that . Taking we obtain the conclusion that .
If the set of marginal distributions on , , has more than one element then for each inactive we have provided that holds. For each active , then by the strong converse result for the simple hypothesis testing problem we obtain , for all . Thereby we obtain the conclusion.
Proof of Lemma 3: Select for any in the optimization domain of a . Then we can see that
Given an , let be an optimal solution for . As , and , we have
(210) |
We then have
(211) |
where as . Taking the supremum over we obtain the conclusion.
Proof of (207) and (208): First, we have
(212) |
Note that since , whenever we must have for all . This implies that . The absolute continuities ensure the validity of the following derivations
(213) |
and
(214) |
In the last step we have use the inequality , as in this case and might not be independent of each other. This implies (207). Furthermore, since conditioning reduces entropy, we can lower bound the term as follows
(215) |
C-B The case that
Fix a compression rate and an arbitrary . If the inactive set is empty, , then we have and the threshold becomes . We can use the achievability of Theorem 1 and the strong converse in the subsection C-A to verify the statement. Therefore in the following we assume that the inactive set is non-empty, . Our coding scheme is influenced by the one given in [7].
C-B1 Construction of a testing scheme
Recall that represents the distributions which have the same marginal distribution on , and is the number of these. For each inactive we partition the set into sets such that for all sufficiently large , we have for all . This is possible111111Let be an arbitrarily given number. Given a type class which is a subset of the strongly typical set , we divide it into subsets such that each subset has cardinality of and omit the remaining sequences. Enumerating over all type classes inside the typical set, the number of omitted typical sequences is upper bounded by . Hence each of the constructed subsets has probability of at least for all sufficiently large . The atypical sequences and the omitted typical sequences can be then assigned into these sets randomly. because we have . Furthermore for each where , let be its position inside the set according to the natural ordering.
For each of these inactive states and , let be a sequence of testing schemes to differentiate between and such that is achievable. Then similarly as in the conclusion of Theorem 1, the false alarm probability of the likelihood ratio test also goes to zero
(216) |
Note also that the following expressions
(217) |
hold due to either the weak law of large numbers or Theorem 1 in [24]. Rewriting the above expression we therefore obtain
(218) |
For each active state , , let be a sequence of testing schemes to differentiate between and such that is achievable. The existence of follows from Theorem 1. Similarly from the conclusion of Theorem 1, for each intersected set , , we have
(219) |
From the definition of in (17) we further have
(220) |
Using (217) again we obtain
(221) |
Define the following auxiliary distribution
(222) |
Let and be the marginal distributions of . Furthermore, let be the push-forward distribution resulting from applying and to , i.e.,
(223) |
We use the same mapping to estimate before encoding the as in the proof of Theorem 1. If we send to the decision center. If , we check if where
(224) |
If the condition is not fulfilled we send a special symbol . Assume that the condition holds. If we send the following message to the decision center . If we send . For , we decide that the null hypothesis is true if , and
(225) |
C-B2 Bounding error probabilities
For each where is active, , the false alarm probability is given by
(226) |
Similarly, for each where is not active, , we have
(227) |
for all sufficiently large . The last inequality holds since for all sufficiently large we have by the definition of . Let be a sequence such that and as . Using the change of measure steps as in the proof of Theorem 1 we have
(228) |
C-C Convergence of
We assume that the optimality achieving index is active, , otherwise there is nothing to prove. Let , and be defined as in (176), (179) and (180) with and in place of and in place of .
Let be an arbitrary sequence of testing schemes such that where for an arbitrary holds. Select small enough such that .
Similarly as in the previous proofs we transform to a testing scheme , where is a deterministic mapping, for differentiating between and . This can be done by using the typical sets and defined in (173). The resulting error probabilities are similarly bounded by
(232) |
Let be the acceptance region of . We argue that there exists a such that for all we have
(233) |
Assume otherwise that for all there exists an such that
(234) |
Similarly for notation simplicity we define as well as
(235) |
We then have
(236) |
For it can be seen that
holds whereas for we have . Therefore for all the following inequalities hold
(237) |
This implies further that we also have
(238) |
Hence we obtain
(239) |
where . By using the same lines of arguments as from (205) to (209) we have for given positive and
(240) |
As is a closed convex set and holds, by hyperplane separation theorem there exist positive numbers and such that holds. Furthermore there also exists an such that . Then we obtain for such
(241) |
This inequality is violated for . Then (233) and (232) imply that
(242) |
References
- [1] R. Ahlswede and I. Csiszár, “Hypothesis testing with communication constraints,” IEEE Transactions on Information Theory, vol. 32, no. 4, pp. 533–542, 1986.
- [2] T. Berger, “Decentralized estimation and decision theory,” in IEEE 7th Spring Workshop on Inf. Theory, Mt. Kisco, NY, September 1979.
- [3] T. Han, “Hypothesis testing with multiterminal data compression,” IEEE Transactions on Information Theory, vol. 33, no. 6, pp. 759–772, 1987.
- [4] H. Shimokawa, T. S. Han, and S. Amari, “Error bound of hypothesis testing with data compression,” in Proceedings of 1994 IEEE International Symposium on Information Theory. IEEE, 1994, p. 114.
- [5] H. M. Shalaby and A. Papamarcou, “Multiterminal detection with zero-rate data compression,” IEEE Transactions on Information Theory, vol. 38, no. 2, pp. 254–267, 1992.
- [6] M. S. Rahman and A. B. Wagner, “On the optimality of binning for distributed hypothesis testing,” IEEE Transactions on Information Theory, vol. 58, no. 10, pp. 6282–6303, 2012.
- [7] C. Tian and J. Chen, “Successive refinement for hypothesis testing and lossless one-helper problem,” IEEE Transactions on Information Theory, vol. 54, no. 10, pp. 4666–4681, 2008.
- [8] S. Watanabe, “Neyman–pearson test for zero-rate multiterminal hypothesis testing,” IEEE Transactions on Information Theory, vol. 64, no. 7, pp. 4923–4939, 2017.
- [9] G. J. McLachlan and D. Peel, Finite mixture models. New York: John Wiley & Sons, 2000.
- [10] P.-N. Chen, “General formulas for the neyman-pearson type-ii error exponent subject to fixed and exponential type-i error bounds,” IEEE Transactions on Information Theory, vol. 42, no. 1, pp. 316–323, 1996.
- [11] T. S. Han, Information-Spectrum Methods in Information Theory. Berlin Heidelberg: Springer-Verlag Berlin Heidelberg, 2003.
- [12] T. S. Han, R. Nomura et al., “First-and second-order hypothesis testing for mixed memoryless sources,” Entropy, vol. 20, no. 3, p. 174, 2018.
- [13] A. Ritchie, R. A. Vandermeulen, and C. Scott, “Consistent estimation of identifiable nonparametric mixture models from grouped observations,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 676–11 686, 2020.
- [14] R. T. Elmore, T. P. Hettmansperger, and H. Thomas, “Estimating component cumulative distribution functions in finite mixture models,” Communications in Statistics-Theory and Methods, vol. 33, no. 9, pp. 2075–2086, 2004.
- [15] I. Cruz-Medina, T. Hettmansperger, and H. Thomas, “Semiparametric mixture models and repeated measures: the multinomial cut point model,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 53, no. 3, pp. 463–474, 2004.
- [16] Y. Wei and X. Nguyen, “Convergence of de finetti’s mixing measure in latent structure models for observed exchangeable sequences,” arXiv preprint arXiv:2004.05542, 2020.
- [17] C. Pal, B. Frey, and T. Kristjansson, “Noise robust speech recognition using gaussian basis functions for non-linear likelihood function approximation,” in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1. IEEE, 2002, pp. I–405.
- [18] A. Anandkumar, D. Hsu, and S. M. Kakade, “A method of moments for mixture models and hidden markov models,” in Conference on Learning Theory. JMLR Workshop and Conference Proceedings, 2012, pp. 33–1.
- [19] M. I. Jordan, “Stat260: Bayesian Modeling and Inference, Lecture 1: History and De Finetti’s Theorem,” 2010.
- [20] W. Kirsch, “An elementary proof of de finetti’s theorem,” Statistics & Probability Letters, vol. 151, pp. 84–88, 2019.
- [21] M. T. Vu, T. J. Oechtering, and M. Skoglund, “Hypothesis testing and identification systems,” IEEE Transactions on Information Theory, vol. 67, no. 6, pp. 3765–3780, 2021.
- [22] A. Wyner, “On source coding with side information at the decoder,” IEEE Transactions on Information Theory, vol. 21, no. 3, pp. 294–300, 1975.
- [23] R. Ahlswede and J. Körner, “Source coding with side information and a converse for degraded broadcast channels,” IEEE Transactions on Information Theory, vol. 21, no. 6, pp. 629–637, 1975.
- [24] A. R. Barron, “The Strong Ergodic Theorem for Densities: Generalized Shannon-McMillan-Breiman Theorem,” Annals of Probability, vol. 13, no. 4, pp. 1292–1303, 1985.
- [25] S. Verdú, “Non-asymptotic achievability bounds in multiuser information theory,” in Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on. IEEE, 2012, pp. 1–8.
- [26] H. Tyagi and S. Watanabe, “Strong converse using change of measure arguments,” IEEE Transactions on Information Theory, vol. 66, no. 2, pp. 689–703, 2019.
- [27] I. Csiszar and J. Körner, Information theory: coding theorems for discrete memoryless systems. Cambridge: Cambridge University Press, 2011.
- [28] H. Witsenhausen and A. Wyner, “A conditional entropy bound for a pair of discrete random variables,” IEEE Transactions on Information Theory, vol. 21, no. 5, pp. 493–501, 1975.
- [29] J. Körner and K. Marton, “Comparison of two noisy channels,” in Topics in Information Theory, vol. 16. Amsterdam: North Holland, 1977, pp. 411–424.
- [30] S. Miyake and F. Kanaya, “Coding theorems on correlated general sources,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 78, no. 9, pp. 1063–1070, 1995.
- [31] Y. Polyanskiy and Y. Wu, “Lecture notes on information theory,” MIT (6.441), UIUC (ECE 563), 2017.