Bottleneck Problems:
Information and Estimation-Theoretic View††thanks: This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.
Abstract
Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding and dependence dilution. Leveraging these connections, we prove a new cardinality bound for the auxiliary variable in IB, making its computation more tractable for discrete random variables.
In the second part, we introduce a general family of optimization problems, termed as bottleneck problems, by replacing mutual information in IB and PF with other notions of mutual information, namely -information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantee in a variety of inference tasks with statistical constraints on accuracy and privacy. Although the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to binary case, we derive closed form expressions for several bottleneck problems.
I Introduction
Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck () method [tishby2000information]. Suppose is a target variable and is an observable correlated variable with joint distribution . The goal of is to learn a "compact" summary (aka bottleneck) of that is maximally "informative" for inferring . The bottleneck variable is assumed to be generated from by applying a random function to , i.e., , in such a way that it is conditionally independent of given , that we denote by
(1) |
The quantifies this goal by measuring the “compactness” of using the mutual information and, similarly, “informativeness” by . For a given level of compactness , extracts the bottleneck variable that solves the constrained optimization problem
(2) |
where the supremum is taken over all randomized functions satisfying .
The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s — see [Gerber, Witsenhausen_Wyner, Gerbator_Ahlswede1977, Wyner_Gerber] — as a technique to prove impossibility results in information theory and also to study the common information between and . Wyner and Ziv [Gerber] explicitly determined the value of for the special case of binary and — a result widely known as Mrs. Gerber’s Lemma [Gerber, networkinfotheory]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [tishby2000information] and re-formulated in a data analytic context. Here, the random variable represents a high-dimensional observation with a corresponding low-dimensional feature . aims at specifying a compressed description of image which is maximally informative about feature . This framework led to several applications in clustering [slonim2000document, IB_clustering, Agglomerative_IB] and quantization [Quantization_IB, Quantization2_IB].
A closely-related framework to is the privacy funnel () problem [Makhdoumi2014FromTI, Calmon_fundamental-Limit, Asoodeh_Allerton]. In the framework, a bottleneck variable is sought to maximally preserve "information" contained in while revealing as little about as possible. This framework aims to capture the inherent trade-off between revealing perfectly and leaking a sensitive attribute . For instance, suppose a user wishes to share an image for some classification tasks. The image might carry information about attributes, say , that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g, location, or emotion. The framework seeks to extract a representation of from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to . Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as
(3) |
where the infumum is taken over all randomized function and is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that and are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [Makhdoumi2014FromTI, Sadeghi_PF] for estimating based on the agglomerative information bottleneck [Agglomerative_IB] algorithm. A similar formulation has recently been proposed in [PF_Adverserially] as a tool to train a neural network for learning a private representation of data . Solving and optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both and (see Related Work).
In this paper, we first give a cohesive overview of the existing results surrounding the and the formulations. We then provide a comprehensive analysis of and from an information-theoretic perspective, as well as a survey of several formulations connected to the and that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [Noisy_SourceCoding_Dobrushin], testing against independence [Hypothesis_Testing_Ahslwede], and dependence dilution [Asoode_submitted]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in , leading to more tractable optimization problem for . We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [Arimoto_Original_Paper] and -information [Maxim_Strong_TIT]. By replacing and/or with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by and . More specifically, our main contributions are listed next.
-
•
Computing and are notoriously challenging when takes values in a set with infinite cardinality (e.g., is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that is a Gaussian perturbation of , i.e., where is a noise variable sampled from a Gaussian distribution independent of . Building upon the recent advances in entropy power inequality in [EPI_Courtade], we derive a sharp upper bound for . As a special case, we consider jointly Gaussian for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable is also Gaussian than the original proof given in [GaussianIB]. In the second scenario, we assume that is a Gaussian perturbation of , i.e., . This corresponds to a practical setup where the feature might be perfectly obtained from a noisy observation of . Relying on the recent results in strong data processing inequality [calmon2015strong], we obtain an upper bound on which is tight for small values of . In the last scenario, we compute second-order approximation of under the assumption that is obtained by Gaussian perturbation of , i.e., . Interestingly, the rate of increase of for small values of is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [Renyi-dependence-measure].
-
•
We extend the Witsenhausen and Wyner’s approach [Witsenhausen_Wyner] for analytically computing and . This technique converts solving the optimization problems in and to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary and and derive a closed form expression for – we call this result Mr. Gerber’s Lemma.
-
•
Relying on the connection between and noisy source coding [Noisy_SourceCoding_Dobrushin] (see [Bottleneck_Polyanskiy, Bottleneck_Shamai]), we show that the optimal bottleneck variable in optimization problem (2) takes values in a set with cardinality . Compared to the best cardinality bound previously known (i.e., ), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from to . Moreover, we show that this does not hold for , indicating a fundamental difference in optimizations problems (2) and (3).
-
•
Following [strouse2017dib, Asoodeh_Allerton], we study the deterministic and (denoted by and ) in which is assumed to be a deterministic function of , i.e., for some function . By connecting and with entropy-constrained scalar quantization problems in information theory [Polyanskiy_Distilling], we obtain bounds on them explicitly in terms of . Applying these bounds to , we obtain that is bounded by one from above and by from below.
-
•
By replacing and/or in (2) and (3) with Arimoto’s mutual information or -information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by and . The main reason is three-fold: First, as illustrated in Section II-C, mutual information in and are mainly justified when independent samples of are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for ). Second, in and is meant to be a proxy for the efficiency of reconstructing given observation . However, this can be accurately formalized by probability of correctly guessing given (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating given . While bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and -information, respectively. Finally, when is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and -information are easier to estimate: While mutual information can be estimated with estimation error that scales as [Shamir_IB], Diaz et a. [Diaz_Robustness] showed that this estimation error for Arimoto’s mutual information and -information is .
We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary and , we derive closed form expressions for some of the bottleneck problems.
I-A Related Work
The formulation has been extensively applied in representation learning and clustering [IB_clustering, IB_DocumentClustering, IB_DoubleClustering, IB_Hidden, Zaidi_distributedIB, Zaidi2019distributed]. Clustering based on results in algorithms that cluster data points in terms of the similarity of . When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [strouse2017dib, strouse2019clustering] proposed the deterministic (denoted by ) by enforcing that is a deterministic mapping: denotes the supremum of over all functions satisfying . This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function with a pre-determined output alphabet with optimizing some objective functions. This objective might be maximizing or minimizing [Cicalese] or maximizing for a random variable correlated with [Polyanskiy_Distilling, Lapidoth_Koch, LDPC1_quantization, LDPC2_quantization]. Since for , the latter problem provides lower bounds for (and thus for ). In particular, one can exploit [LDPC3_quantization, Theorem 1] to obtain provided that . This result establishes a linear gap between and irrespective of .
The connection between quantization and further allows us to obtain multiplicative bounds. For instance, if and , where is independent of , then it is well-known in information theory literature that for all non-constant (see, e.g., [Viterbi, Section 2.11]), thus for . We further explore this connection to provide multiplicative bounds on in Section II-E.
The study of has recently gained increasing traction in the context of deep learning. By taking to be the activity of the hidden layer(s), Tishby and Zaslavsky [tishby2015deep] (see also [IB_DP_openBox]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the optimization problems. In fact, it is claimed that the graph of the function (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing while deep layers’ objective is minimizing . While the generality of this claim was refuted empirically in [On_IB_DL] and theoretically in [Inf_flow_IB_Polyiansky, Amjad_IB], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the optimization problem [alemi2016deep, kolchinsky2017nonlinear, kolchinsky2018caveats, ReleventSparseCode, wickstrom2020information]; (ii) creating connections between and generalization error [Piantanida_roleIB], robustness [alemi2016deep], and detection of out-of-distribution data [Alemi_Uncertainity]; and (iii) using to understand specific characteristic of neural networks [Yu2018UnderstandingCN, Cheng2018EvaluatingCO, wickstrom2020information, Higgins2017betaVAELB].
In both and , mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [On_IB_DL, Inf_flow_IB_Polyiansky, Amjad_IB] and also may not lead to proper privacy guarantee [Issa_Leakage_TIT]. As suggested in [wickstrom2020information, Sufficient_Statistics], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [Asoodeh_CWIT, Asoode_submitted, Fawaz_Makhdoumi], probability of correctly recovering [Asoodeh_TIT19, Asoode_ISIT17], minimum mean-squared estimation error (MMSE) [Asoode_MMSE_submitted, Calmon_principal_TIT], -information [Hao_Privacy_estimation] (a special case of -information to be described in Section III), Arimoto’s and Sibson’s mutual information [Shahab_PhD_thesis, Issa_Leakage_TIT] – to be discussed in Section III, maximal leakage [Liao_maximal_leakage], and local differential privacy [privacyaware]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in111The original results in [Asoode_MMSE_submitted, Calmon_principal_TIT] involve Rényi maximal correlation instead of -information. However, it can be shown that -information is equal to the sum of squares of the singular values of minus one (the largest one), while Rényi maximal correlation is equal to the second largest singular value [Witsenhausen:dependent]. Thus, -information upper bounds Rényi maximal correlation. [Asoode_MMSE_submitted, Calmon_principal_TIT] that if -information between and is sufficiently small, then no functions of can be efficiently reconstructed given ; thus providing an interpretable privacy guarantee.
Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from samples with the estimation error that scales as [Shamir_IB]. However, as shown by Diaz et al. [Diaz_Robustness], the estimation error for most of the above measures scales as . Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [MI_Estimator_Poole, MINE_Belghazi, Contrastive], presents some fundamental limitations [Understanding_Variational]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [Stratos_MI_Estimator] showed that some of these limitations are inherent to a large family of mutual information estimators.
I-B Notation
We use capital letters, e.g., , for random variables and calligraphic letters for their alphabets, e.g., . If is distributed according to probability mass function (pmf) , we write . Given two random variables and , we write and as the joint distribution and the conditional distribution of given . We also interchangeably refer to as a channel from to . We use to denote both entropy and differential entropy of , i.e., we have
if is a discrete random variable taking values in with probability mass function (pmf) and
where is an absolutely continuous random variable with probability density function (pdf) . If is a binary random variable with , we write . In this case, its entropy is called binary entropy function and denoted by . We use superscript to describe a standard Gaussian random variable, i.e., . Given two random variables and , their (Shannon’s) mutual information is denoted by . We let denote the set of all probability distributions on the set . Given an arbitrary and a channel , we let denote the resulting output distribution on . For any , we use to denote and for any integer , .
Throughout the paper, we assume a pair of (discrete or continuous) random variables are given with a fixed joint distribution , marginals and , and conditional distribution . We then use to denote an arbitrary distribution with .
II Information Bottleneck and Privacy Funnel: Definitions and Functional Properties
In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.
To precisely quantify the trade-off between these two conflicting goals, the optimization problem (2) was proposed [tishby2000information]. Since any randomized function can be equivalently characterized by a conditional distribution, (2) can be instead expressed as
(4) |
where and denote the level of desired compression and informativeness, respectively. We use and to denote and , respectively, when the joint distribution is clear from the context. Notice that if , then .
Now consider the setup where data is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by . This goal was formulated by in (3). As before, replacing randomized function with conditional distribution , we can equivalently express (3) as
(5) |
where and denote the level of desired privacy and informativeness, respectively. The case is particularly interesting in practice and specifies perfect privacy, see e.g., [Calmon_fundamental-Limit, Rassouli_Perfect]. As before, we write and for and when is clear from the context.



The following properties of and follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix LABEL:Appendix_ProofSecIB_PF.
Theorem 1.
For a given , the mappings and have the following properties:
-
•
.
-
•
for any and for .
-
•
for any and for any .
-
•
is continuous, strictly increasing, and concave on the range .
-
•
is continuous, strictly increasing, and convex on the range .
-
•
If for all and , then both and are continuously differentiable over .
-
•
is non-increasing and is non-decreasing.
-
•
We have
According to this theorem, we can always restrict both and in (4) and (5), respectively, to as for all .
Define as
(6) |
It can be directly verified that is convex. According to this theorem, and correspond to the upper and lower boundary of , respectively. The convexity of then implies the concavity and convexity of and . Fig. 1 illustrates the set for the simple case of binary and .
While both and , their behavior in the neighborhood around zero might be completely different. As illustrated in Fig. 1, for all , whereas for for some . When such exists, we say perfect privacy occurs: there exists a variable satisfying such that while ; making a representation of having perfect privacy (i.e., no information leakage about ). A necessary and sufficient condition for the existence of such is given in [Asoode_submitted, Lemma 10] and [Calmon_fundamental-Limit, Theorem 3], described next.
Theorem 2 (Perfect privacy).
Let be given and be the set of vectors . Then there exists such that for if and only if vectors in are linearly independent.
In light of this theorem, we obtain that perfect privacy occurs if . It also follows from the theorem that for binary , perfect privacy cannot occur (see Fig. 1(a)).
Theorem 1 enables us to derive a simple bounds for and . Specifically, the facts that is non-decreasing and is non-increasing immediately result in the the following linear bounds.
Theorem 3 (Linear lower bound).
For , we have
(7) |
In light of this theorem, if , then , implying for a deterministic function . Conversely, if then because for all forming the Markov relation , we have . On the other hand, we have if and only if there exists a variable satisfying and thus the following double Markov relations
It can be verified (see [csiszarbook, Problem 16.25]) that this double Markov condition is equivalent to the existence of a pair of functions and such that and . One special case of this setting, namely where is an identity function, has been recently studied in details in [kolchinsky2018caveats] and will be reviewed in Section II-E. Theorem 3 also enables us to characterize the "worst" joint distribution with respect to and . As demonstrated in the following lemma, if is an erasure channel then .
Lemma 1.
-
•
Let be such that , , and for some . Then
-
•
Let be such that , , and for some . Then
The bounds in Theorem 3 hold for all and in the interval . We can, however, improve them when and are sufficiently small. Let and denote the slope of and at zero, i.e., and .
Theorem 4.
Given , we have
This theorem provides the exact values of and and also simple bounds for them. Although the exact expressions for and are usually difficult to compute, a simple plug-in estimator is proposed in [Hypercontractivity_NIPS2017] for . This estimator can be readily adapted to estimate . Theorem 4 reveals a profound connection between and the strong data processing inequality (SDPI) [Ahlswede_Gacs]. More precisely, thanks to the pioneering work of Anantharam et al. [anantharam], it is known that the supremum of over all is equal the supremum of over all satisfying and hence specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for , especially when or are continuous random variables. In particular, the recent non-multiplicative SDPI results [Polyanskiy, calmon2015strong] seem insightful for this purpose.
In many practical cases, we might have i.i.d. samples of . We now study how behaves in . Let and . Due to the i.i.d. assumption, we have . This can also be described by independently feeding , , to channel producing . The following theorem, demonstrated first in [Witsenhausen_Wyner, Theorem 2.4], gives a formula for in terms of .
Theorem 5 (Additivity).
We have
This theorem demonstrates that an optimal channel for i.i.d. samples is obtained by the Kronecker product of an optimal channel for . This, however, may not hold in general for , that is, we might have , see [Calmon_fundamental-Limit, Proposition 1] for an example.
II-A Gaussian and
In this section, we turn our attention to a special, yet important, case where , where and is independent of . This setting subsumes the popular case of jointly Gaussian whose information bottleneck functional was computed in [Bottleneck_Gaussian] for the vector case (i.e., are jointly Gaussian random vectors).
Lemma 2.
Let be i.i.d. copies of and where are i.i.d samples of independent of . Then, we have
It is worth noting that this result was concurrently proved in [Zaidi_GaussianCase]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [EPI_Courtade, Theorem 2] which holds even if , , and are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on . This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian , the optimal channel is Gaussian. This result was first proved in [GaussianIB] and used in [Bottleneck_Gaussian] to compute an expression of for the Gaussian case.
Corollary 1.
If are jointly Gaussian with correlation coefficient , then we have
(8) |
Moreover, the optimal channel is given by for where is the variance of .
In Lemma 2, we assumed that is a Gaussian perturbation of . However, in some practical scenarios, we might have as a Gaussian perturbation of . For instance, let represent an image and be a feature of the image that can be perfectly obtained from a noisy observation of . Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [calmon2015strong, Theorem 1], gives an upper bound for in this case.
Lemma 3.
Let be i.i.d. copies of a random variable satisfying and be the result of passing , , through a Gaussian channel , where and is independent of . Then, we have
(9) |
where
(10) |
is the Gaussian complimentary CDF and for is the binary entropy function. Moreover, we have
(11) |
Note that that Lemma 3 holds for any arbitrary (provided that ) and hence (9) bounds information bottleneck functionals for a wide family of . However, the bound is loose in general for large values of . For instance, if are jointly Gaussian (implying for some ), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Fig. 2.

The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in over without any assumptions is a difficult challenge. A natural assumption to make is that is Gaussian for each . This leads to the following variant of
where
and is independent of . This formulation is tractable and can be computed in closed form for jointly Gaussian as described in the following example.
Example 1. Let and be jointly Gaussian with correlation coefficient . First note that since mutual information is invariant to scaling, we may assume without loss of generality that both and are zero mean and unit variance and hence we can write where is independent of . Consequently, we have
(12) |
and
(13) |
In order to ensure , we must have . Plugging this choice of into (13), we obtain
(14) |
This example indicates that for jointly Gaussian , we have if and only if (thus perfect privacy does not occur) and the constraint is satisfied by a unique . These two properties in fact hold for all continuous variables and with finite second moments as demonstrated in Lemma LABEL:Lemma:StrictCon_IYT in Appendix LABEL:Appendix_ProofSecIB_PF. We use these properties to derive a second-order approximation of when is sufficiently small. For the following theorem, we use to denote the variance of the random variable and . We use for short.
Theorem 6.
For any pair of continuous random variables with finite second moments, we have as
where and
It is worth mentioning that the quantity was first defined by Rényi [Renyi-dependence-measure] as an asymmetric measure of correlation between and . In fact, it can be shown that where supremum is taken over all measurable functions and denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian and with correlation coefficient for which was computed in Example 2. In this case, it can be easily verified that and . Hence, for jointly Gaussian with correlation coefficient and unit variance, we have . In Fig. 3, we compare the approximation given in Theorem 6 for this particular case.

II-B Evaluation of and
The constrained optimization problems in the definitions of and are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both and are often approximated by their corresponding Lagrangian optimizations
(15) |
and
(16) |
where is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for and the privacy and informativeness in . Notice that for the computation of , we can assume, without loss of generality, that since otherwise the maximizer of (15) is trivial. It is worth noting that and in fact correspond to lines of slope supporting from above and below, thereby providing a new representation of .
Let be a pair of random variables with for some and is the output of when the input is (i.e., ). Define
This function, in general, is neither convex nor concave in . For instance, is concave and is convex in . The lower convex envelope (resp. upper concave envelope) of is defined as the largest (resp. smallest) convex (resp. concave) smaller (larger) than . Let and denote the lower convex and upper concave envelopes of , respectively. If is convex at , that is , then remains convex at for all because
where the last equality follows from the fact that is convex. Hence, at we have
Analogously, if is concave at , that is , then remains concave at for all .
Notice that, according to (15) and (16), we can write
(17) |
and
(18) |
In light of the above arguments, we can write
for all where is the smallest such that touches . Similarly,
for all where is the largest such that touches . In the following theorem, we show that and are given by the values of and , respectively, given in Theorem 4. A similar formulae and were given in [Learability_2019].
Proposition 1.
We have,
and
Kim et al. [Hypercontractivity_NIPS2017] have recently proposed an efficient algorithm to estimate from samples of involving a simple optimization problem. This algorithm can be readily adapted for estimating . Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier , that is
(19) |
and
(20) |
Remark 1.
As demonstrated by Kolchinsky et al. [kolchinsky2018caveats], the boundary points and are required for the computation of . In fact, when is a deterministic function of , then only and are required to compute the and other values of are vacuous. The same argument can also be used to justify the inclusion of in computing . Note also that since becomes convex for , computing becomes trivial for such values of .
Remark 2.
Observe that the lower convex envelope of any function can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [Legendre_transformation_alg, Legendre_transformation_alg2]) for approximating .
Once and are computed, we can derive and via standard results in optimization (see [Witsenhausen_Wyner, Section IV] for more details):
(21) |
and
(22) |
Following the convex analysis approach outlined by Witsenhausen and Wyner [Witsenhausen_Wyner], and can be directly computed from and by observing the following. Suppose for some , (resp. ) at is obtained by a convex combination of points , for some in , integer , and weights (with ). Then , and with properties and attains the minimum (resp. maximum) of . Hence, is a point on the upper (resp. lower) boundary of ; implying that for (resp. for ). If for some , at coincides with , then this corresponds to . The same holds for . Thus, all the information about the functional (resp. ) is contained in the subset of the domain of (resp. ) over which it differs from . We will revisit and generalize this approach later in Section III.
We can now instantiate this for the binary symmetric case. Suppose and are binary variables and is binary symmetric channel with crossover probability , denoted by and defined as
(23) |
for some . To describe the result in a compact fashion, we introduce the following notation: we let denote the binary entropy function, i.e., . Since this function is strictly increasing , its inverse exists and is denoted by . Also, for .
Lemma 4 (Mr. and Mrs. Gerber’s Lemma).
For for and for , we have
(24) |
and
(25) |
where , , and .
The result in (24) was proved by Wyner and Ziv [Gerber] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping given by
(26) |
where is the output distribution of when the input distribution is for some . It can be verified that . This function is depicted in Fig. 4 depending of the values of .



II-C Operational Meaning of and
In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both and . The operational interpretation of has recently been extensively studied in information-theoretic settings in [Bottleneck_Polyanskiy, Bottleneck_Shamai]. In particular, it was shown that specifies the rate-distortion region of noisy source coding problem [Noisy_SourceCoding_Dobrushin, Witsenhusen_Indirect] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [Wyner_SourceCoding]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by . This in fact delineate an important difference between and : while describes the entire rate-region of an information-theoretic setup, specifies only a corner point of a rate region. Other information-theoretic settings related to and include CEO problem [Courtade_CEO] and source coding for the Gray-Wyner network [Cheuk_LI_GrayWyner].
II-C1 Noisy Source Coding
Suppose Alice has access only to a noisy version of a source of interest . She wishes to transmit a rate-constrained description from her observation (i.e., ) to Bob such that he can recover with small average distortion. More precisely, let be i.i.d. samples of . Alice encodes her observation through an encoder and sends to Bob. Upon receiving , Bob reconstructs a "soft" estimate of via a decoder where . That is, the reproduction sequence consists of probability measures on . For any source and reproduction sequences and , respectively, the distortion is defined as
where
(27) |
We say that a pair of rate-distortion is achievable if there exists a pair of encoder and decoder such that
(28) |
The noisy rate-distortion function for a given , is defined as the minimum rate such that is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [Noisy_SourceCoding_Dobrushin], who showed that is analogous to the classical rate-distortion function
(29) |
It can be easily verified that and hence (after relabeling as )
(30) |
where , which is equal to defined in (4). For more details in connection between noisy source coding and , the reader is referred to [Bottleneck_Shamai, Bottleneck_Polyanskiy, Courtade_CEO, Collaborative_IB]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by
This problem is addressed in [IB_operational] for discrete alphabets and and extended recently in [IB_General] for any general alphabets.
II-C2 Test Against Independence with Communication Constraint
As mentioned earlier, the connection between and noisy source coding, described above, was known and studied in [Bottleneck_Shamai, Bottleneck_Polyanskiy]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for . Given i.i.d. samples from joint distribution , we wish to test whether are independent of , that is, is a product distribution. This task is formulated by the following hypothesis test:
(31) |
for a given joint distribution with marginals and . Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] investigated this problem under a communication constraint: While observations (i.e., ) are available, the observations need to be compressed at rate , that is, instead of , only is present where satisfies
For the type I error probability not exceeding a fixed , Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] derived the smallest possible type 2 error probability, defined as
The following gives the asymptotic expression of for every . For the proof, refer to [Hypothesis_Testing_Ahslwede, Theorem 3].
Theorem 7 ([Hypothesis_Testing_Ahslwede]).
For every and , we have
In light of this theorem, specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.
II-C3 Dependence Dilution
Inspired by the problems of information amplification [Cover_State_Amplification] and state masking [Merhav_state_masking], Asoodeh et al. [Asoode_submitted] proposed the dependence dilution setup as follows. Consider a source sequences of i.i.d. copies of . Alice observes the source and wishes to encode it via the encoder
for some . The goal is to ensure that any user observing can construct a list, of fixed size, of sequences in that contains likely candidates of the actual sequence while revealing negligible information about a correlated source . To formulate this goal, consider the decoder
where denotes the power set of . A dependence dilution triple is said to be achievable if, for any , there exists a pair of encoder and decoder such that for sufficiently large
(32) |
having fixed size where and simultaneously
(33) |
Notice that without side information , the decoder can only construct a list of size which contains with probability close to one. However, after is observed and the list is formed, the decoder’s list size can be reduced to and thus reducing the uncertainty about by . This observation can be formalized to show (see [Cover_State_Amplification] for details) that the constraint (32) is equivalent to
(34) |
which lower bounds the amount of information carries about . Built on this equivalent formulation, Asoodeh et al. [Asoode_submitted, Corollary 15] derived a necessary condition for the achievable dependence dilution triple.
Theorem 8 ([Asoode_submitted]).
Any achievable dependence dilution triple satisfies
for some auxiliary random variable satisfying and taking values.
According to this theorem, specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate . While this informs the operational interpretation of , Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple . It is, however, not clear that characterizes the rate region of an information-theoretic setup.
The fact that fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable in can be improved to instead of .
II-D Cardinality Bound
Recall that in the definition of in (4), no assumption was imposed on the auxiliary random variable . A straightforward application of Carathéodory-Fenchel-Eggleston theorem222This is a strengthening of the original Carathéodory theorem when the underlying space is connected, see e.g., [Witsenhausen_Convexity, Section III] or [csiszarbook, Lemma 15.4]. reveals that is attained for taking values in a set with cardinality . Here, we improve this bound and show that cardinality bound to .
Theorem 9.
For any joint distribution and , information bottleneck is achieved by taking at most values.
The proof of this theorem hinges on the operational characterization of as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section II-C. Specifically, we first show that the extreme points of this region is achieved by taking values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such . It must be mentioned that this result was already claimed by Harremoës and Tishby in [harremoes2007information] without proof.
In many practical scenarios, feature has a large alphabet. Hence, the bound , albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for in terms of , one can consider approximating assuming takes values. The following result, recently proved by Hirche and Winter [Hirche_IB_cardinality], is in this spirit.
Theorem 10 ([Hirche_IB_cardinality]).
For any , we have
where and denotes the information bottleneck functional (4) with the additional constraint that .
Recall that, unlike , the graph of characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section II-C), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of translates into a worse cardinality bound than that of . In fact, for the cardinality bound cannot be improved in general. To demonstrate this, we numerically solve the optimization in assuming that when both and are binary. As illustrated in Fig. 5, this optimization does not lead to a convex function, and hence, cannot be equal to .

II-E Deterministic Information Bottleneck
As mentioned earlier, formalizes an information-theoretic approach to clustering high-dimensional feature into cluster labels that preserve as much information about the label as possible. The clustering label is assigned by the soft operator that solves the formulation (4) according to the rule: is likely assigned label if is small where . That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [strouse2017dib] suggested the following variant of , termed as deterministic information bottleneck
(35) |
where the maximization is taken over all deterministic functions whose range is a finite set . Similarly, one can define
(36) |
One way to ensure that for a deterministic function is to restrict the cardinality of the range of : if then is necessarily smaller than . Using this insight, we derive a lower for in the following lemma.
Lemma 5.
For any given , we have
and
Note that both and are smaller than and thus the multiplicative factors of in the lemma are smaller than one. In light of this lemma, we can obtain
and
In most of practical setups, might be very large, making the above lower bound for vacuous. In the following lemma, we partially address this issue by deriving a bound independent of when is binary.
Lemma 6.
Let be a joint distribution of arbitrary and binary for some . Then, for any we have
where .
III Family of Bottleneck Problems
In this section, we introduce a family of bottleneck problems by extending and to a large family of statistical measures. Similar to and , these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution . Recall that and are the upper and lower boundary of the set defined in (6) and expressed here again for convenience
(37) |
Since is given, and are fixed. Thus, in characterizing it is sufficient to consider only and . To generalize and , we must therefore generalize and .
Given a joint distribution and two non-negative real-valued functions and , we define
(38) |
and
(39) |
When and , we interchangeably write for and for .
These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Sections LABEL:Sec:Geussing and LABEL:Sec:Arimoto, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order for , probability of correctly guessing for , maximal correlation for binary case, and -information for given by -divergence. We are able to generate a family of bottleneck problems using different instantiations of and in place of mutual information in and . As we argue later, these problems better capture the essence of "informativeness" and "privacy"; thus providing analytical and interpretable guarantees similar in spirit to and .
Computing these bottleneck problems in general boils down to the following optimization problems
(40) |
and