Bottleneck Problems:
Information and Estimation-Theoretic View^†^†thanks: This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.

Shahab Asoodeh and Flavio P. Calmon S. Asoodeh and F. P. Calmon are with School of Engineering and Applied Science, Harvard University (e-mails: {shahab, flavio}@seas.harvard.edu). Part of the results in this paper was presented at the International Symposium on Information Theory 2018 [Hsu_Generalizing].

Abstract

Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding and dependence dilution. Leveraging these connections, we prove a new cardinality bound for the auxiliary variable in IB, making its computation more tractable for discrete random variables.

In the second part, we introduce a general family of optimization problems, termed as bottleneck problems, by replacing mutual information in IB and PF with other notions of mutual information, namely $f$ -information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantee in a variety of inference tasks with statistical constraints on accuracy and privacy. Although the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to binary case, we derive closed form expressions for several bottleneck problems.

I Introduction

Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck ( $\mathsf{IB}$ ) method [tishby2000information]. Suppose $Y$ is a target variable and $X$ is an observable correlated variable with joint distribution $P_{XY}$ . The goal of $\mathsf{IB}$ is to learn a "compact" summary (aka bottleneck) $T$ of $X$ that is maximally "informative" for inferring $Y$ . The bottleneck variable $T$ is assumed to be generated from $X$ by applying a random function $F$ to $X$ , i.e., $T=F(X)$ , in such a way that it is conditionally independent of $Y$ given $X$ , that we denote by

Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T.

(1)

The $\mathsf{IB}$ quantifies this goal by measuring the “compactness” of $T$ using the mutual information $I(X;T)$ and, similarly, “informativeness” by $I(Y;T)$ . For a given level of compactness $R\geq 0$ , $\mathsf{IB}$ extracts the bottleneck variable $T$ that solves the constrained optimization problem

\mathsf{IB}(R)\coloneqq\sup\leavevmode\nobreak\ I(Y;T)\qquad\text{subject\leavevmode\nobreak\ to}\qquad I(X;T)\leq R,

(2)

where the supremum is taken over all randomized functions $T=F(X)$ satisfying $Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T$ .

The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s — see [Gerber, Witsenhausen_Wyner, Gerbator_Ahlswede1977, Wyner_Gerber] — as a technique to prove impossibility results in information theory and also to study the common information between $X$ and $Y$ . Wyner and Ziv [Gerber] explicitly determined the value of $\mathsf{IB}(R)$ for the special case of binary $X$ and $Y$ — a result widely known as Mrs. Gerber’s Lemma [Gerber, networkinfotheory]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [tishby2000information] and re-formulated in a data analytic context. Here, the random variable $X$ represents a high-dimensional observation with a corresponding low-dimensional feature $Y$ . $\mathsf{IB}$ aims at specifying a compressed description of image which is maximally informative about feature $Y$ . This framework led to several applications in clustering [slonim2000document, IB_clustering, Agglomerative_IB] and quantization [Quantization_IB, Quantization2_IB].

A closely-related framework to $\mathsf{IB}$ is the privacy funnel ( $\mathsf{PF}$ ) problem [Makhdoumi2014FromTI, Calmon_fundamental-Limit, Asoodeh_Allerton]. In the $\mathsf{PF}$ framework, a bottleneck variable $T$ is sought to maximally preserve "information" contained in $X$ while revealing as little about $Y$ as possible. This framework aims to capture the inherent trade-off between revealing $X$ perfectly and leaking a sensitive attribute $Y$ . For instance, suppose a user wishes to share an image $X$ for some classification tasks. The image might carry information about attributes, say $Y$ , that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g, location, or emotion. The $\mathsf{PF}$ framework seeks to extract a representation of $X$ from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to $Y$ . Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as

\mathsf{PF}(r)\coloneqq\inf\leavevmode\nobreak\ I(Y;T)\qquad\text{subject\leavevmode\nobreak\ to}\qquad I(X;T)\geq r,

(3)

where the infumum is taken over all randomized function $T=F(X)$ and $r$ is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that $\mathsf{IB}$ and $\mathsf{PF}$ are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [Makhdoumi2014FromTI, Sadeghi_PF] for estimating $\mathsf{PF}$ based on the agglomerative information bottleneck [Agglomerative_IB] algorithm. A similar formulation has recently been proposed in [PF_Adverserially] as a tool to train a neural network for learning a private representation of data $X$ . Solving $\mathsf{IB}$ and $\mathsf{PF}$ optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both $\mathsf{IB}$ and $\mathsf{PF}$ (see Related Work).

In this paper, we first give a cohesive overview of the existing results surrounding the $\mathsf{IB}$ and the $\mathsf{PF}$ formulations. We then provide a comprehensive analysis of $\mathsf{IB}$ and $\mathsf{PF}$ from an information-theoretic perspective, as well as a survey of several formulations connected to the $\mathsf{IB}$ and $\mathsf{PF}$ that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [Noisy_SourceCoding_Dobrushin], testing against independence [Hypothesis_Testing_Ahslwede], and dependence dilution [Asoode_submitted]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in $\mathsf{IB}$ , leading to more tractable optimization problem for $\mathsf{IB}$ . We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [Arimoto_Original_Paper] and $f$ -information [Maxim_Strong_TIT]. By replacing $I(Y;T)$ and/or $I(X;T)$ with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by $\mathsf{IB}$ and $\mathsf{PF}$ . More specifically, our main contributions are listed next.

•

Computing $\mathsf{IB}$ and $\mathsf{PF}$ are notoriously challenging when $X$ takes values in a set with infinite cardinality (e.g., $X$ is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that $X$ is a Gaussian perturbation of $Y$ , i.e., $X=Y+N^{\mathsf{G}}$ where $N^{\mathsf{G}}$ is a noise variable sampled from a Gaussian distribution independent of $Y$ . Building upon the recent advances in entropy power inequality in [EPI_Courtade], we derive a sharp upper bound for $\mathsf{IB}(R)$ . As a special case, we consider jointly Gaussian $(X,Y)$ for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable $T$ is also Gaussian than the original proof given in [GaussianIB]. In the second scenario, we assume that $Y$ is a Gaussian perturbation of $X$ , i.e., $Y=X+N^{\mathsf{G}}$ . This corresponds to a practical setup where the feature $Y$ might be perfectly obtained from a noisy observation of $X$ . Relying on the recent results in strong data processing inequality [calmon2015strong], we obtain an upper bound on $\mathsf{IB}(R)$ which is tight for small values of $R$ . In the last scenario, we compute second-order approximation of $\mathsf{PF}(r)$ under the assumption that $T$ is obtained by Gaussian perturbation of $X$ , i.e., $T=X+N^{\mathsf{G}}$ . Interestingly, the rate of increase of $\mathsf{PF}(r)$ for small values of $r$ is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [Renyi-dependence-measure].
•

We extend the Witsenhausen and Wyner’s approach [Witsenhausen_Wyner] for analytically computing $\mathsf{IB}$ and $\mathsf{PF}$ . This technique converts solving the optimization problems in $\mathsf{IB}$ and $\mathsf{PF}$ to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary $X$ and $Y$ and derive a closed form expression for $\mathsf{PF}(r)$ – we call this result Mr. Gerber’s Lemma.
•

Relying on the connection between $\mathsf{IB}$ and noisy source coding [Noisy_SourceCoding_Dobrushin] (see [Bottleneck_Polyanskiy, Bottleneck_Shamai]), we show that the optimal bottleneck variable $T$ in optimization problem (2) takes values in a set ${\mathcal{T}}$ with cardinality $|{\mathcal{T}}|\leq|{\mathcal{X}}|$ . Compared to the best cardinality bound previously known (i.e., $|{\mathcal{T}}|\leq|{\mathcal{X}}|+1$ ), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from $\mathbb{R}^{|{\mathcal{X}}|^{2}}$ to $\mathbb{R}^{|{\mathcal{X}}|(|{\mathcal{X}}|-1)}$ . Moreover, we show that this does not hold for $\mathsf{PF}$ , indicating a fundamental difference in optimizations problems (2) and (3).
•

Following [strouse2017dib, Asoodeh_Allerton], we study the deterministic $\mathsf{IB}$ and $\mathsf{PF}$ (denoted by $\mathsf{dIB}$ and $\mathsf{dPF}$ ) in which $T$ is assumed to be a deterministic function of $X$ , i.e., $T=f(X)$ for some function $f$ . By connecting $\mathsf{dIB}$ and $\mathsf{dPF}$ with entropy-constrained scalar quantization problems in information theory [Polyanskiy_Distilling], we obtain bounds on them explicitly in terms of $|{\mathcal{X}}|$ . Applying these bounds to $\mathsf{IB}$ , we obtain that $\frac{\mathsf{IB}(R)}{I(X;Y)}$ is bounded by one from above and by $\min\{\frac{R}{H(X)},\frac{e^{R}-1}{|{\mathcal{X}}|}\}$ from below.
•

By replacing $I(Y;T)$ and/or $I(X;T)$ in (2) and (3) with Arimoto’s mutual information or $f$ -information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by $\mathsf{IB}$ and $\mathsf{PF}$ . The main reason is three-fold: First, as illustrated in Section II-C, mutual information in $\mathsf{IB}$ and $\mathsf{PF}$ are mainly justified when $n\gg 1$ independent samples $(X_{1},Y_{1}),\dots,(X_{n},Y_{n})$ of $P_{XY}$ are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for $n=1$ ). Second, $I(Y;T)$ in $\mathsf{IB}$ and $\mathsf{PF}$ is meant to be a proxy for the efficiency of reconstructing $Y$ given observation $T$ . However, this can be accurately formalized by probability of correctly guessing $Y$ given $T$ (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating $Y$ given $T$ . While $I(Y;T)$ bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and $f$ -information, respectively. Finally, when $P_{XY}$ is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and $f$ -information are easier to estimate: While mutual information can be estimated with estimation error that scales as $O(\log n/\sqrt{n})$ [Shamir_IB], Diaz et a. [Diaz_Robustness] showed that this estimation error for Arimoto’s mutual information and $f$ -information is $O(1/\sqrt{n})$ .

We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary $X$ and $Y$ , we derive closed form expressions for some of the bottleneck problems.

I-A Related Work

The $\mathsf{IB}$ formulation has been extensively applied in representation learning and clustering [IB_clustering, IB_DocumentClustering, IB_DoubleClustering, IB_Hidden, Zaidi_distributedIB, Zaidi2019distributed]. Clustering based on $\mathsf{IB}$ results in algorithms that cluster data points in terms of the similarity of $P_{Y|X}$ . When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [strouse2017dib, strouse2019clustering] proposed the deterministic $\mathsf{IB}$ (denoted by $\mathsf{dIB}$ ) by enforcing that $P_{T|X}$ is a deterministic mapping: $\mathsf{dIB}(R)$ denotes the supremum of $I(Y;f(X))$ over all functions $f:{\mathcal{X}}\to{\mathcal{T}}$ satisfying $H(f(X))\leq R$ . This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function $f:{\mathcal{X}}\to[M]\coloneqq\{1,\dots,M\}$ with a pre-determined output alphabet with $f$ optimizing some objective functions. This objective might be maximizing or minimizing $H(f(X))$ [Cicalese] or maximizing $I(Y;f(X))$ for a random variable $Y$ correlated with $X$ [Polyanskiy_Distilling, Lapidoth_Koch, LDPC1_quantization, LDPC2_quantization]. Since $H(f(X))\leq\log M$ for $f:{\mathcal{X}}\to[M]$ , the latter problem provides lower bounds for $\mathsf{dIB}$ (and thus for $\mathsf{IB}$ ). In particular, one can exploit [LDPC3_quantization, Theorem 1] to obtain $I(X;Y)-\mathsf{dIB}(R)\leq O(e^{-2R/|{\mathcal{Y}}|-1})$ provided that $\min\{|{\mathcal{X}}|,2^{R}\}>2|{\mathcal{Y}}|$ . This result establishes a linear gap between $\mathsf{dIB}$ and $I(X;Y)$ irrespective of $|{\mathcal{X}}|$ .

The connection between quantization and $\mathsf{dIB}$ further allows us to obtain multiplicative bounds. For instance, if $Y\sim{\mathsf{Bernoulli}}(\frac{1}{2})$ and $X=Y+N^{\mathsf{G}}$ , where $N^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ is independent of $Y$ , then it is well-known in information theory literature that $I(Y;f(X))\geq\frac{2}{\pi}I(X;Y)$ for all non-constant $f:{\mathcal{X}}\to\{0,1\}$ (see, e.g., [Viterbi, Section 2.11]), thus $\mathsf{dIB}(R)\geq\frac{2}{\pi}I(X;Y)$ for $R\leq 1$ . We further explore this connection to provide multiplicative bounds on $\mathsf{dIB}(R)$ in Section II-E.

The study of $\mathsf{IB}$ has recently gained increasing traction in the context of deep learning. By taking $T$ to be the activity of the hidden layer(s), Tishby and Zaslavsky [tishby2015deep] (see also [IB_DP_openBox]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the $\mathsf{IB}$ optimization problems. In fact, it is claimed that the graph of the function $R\mapsto\mathsf{IB}(R)$ (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing $I(Y;T)$ while deep layers’ objective is minimizing $I(X;T)$ . While the generality of this claim was refuted empirically in [On_IB_DL] and theoretically in [Inf_flow_IB_Polyiansky, Amjad_IB], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the $\mathsf{IB}$ optimization problem [alemi2016deep, kolchinsky2017nonlinear, kolchinsky2018caveats, ReleventSparseCode, wickstrom2020information]; (ii) creating connections between $\mathsf{IB}$ and generalization error [Piantanida_roleIB], robustness [alemi2016deep], and detection of out-of-distribution data [Alemi_Uncertainity]; and (iii) using $\mathsf{IB}$ to understand specific characteristic of neural networks [Yu2018UnderstandingCN, Cheng2018EvaluatingCO, wickstrom2020information, Higgins2017betaVAELB].

In both $\mathsf{IB}$ and $\mathsf{PF}$ , mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [On_IB_DL, Inf_flow_IB_Polyiansky, Amjad_IB] and also may not lead to proper privacy guarantee [Issa_Leakage_TIT]. As suggested in [wickstrom2020information, Sufficient_Statistics], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [Asoodeh_CWIT, Asoode_submitted, Fawaz_Makhdoumi], probability of correctly recovering [Asoodeh_TIT19, Asoode_ISIT17], minimum mean-squared estimation error (MMSE) [Asoode_MMSE_submitted, Calmon_principal_TIT], $\chi^{2}$ -information [Hao_Privacy_estimation] (a special case of $f$ -information to be described in Section III), Arimoto’s and Sibson’s mutual information [Shahab_PhD_thesis, Issa_Leakage_TIT] – to be discussed in Section III, maximal leakage [Liao_maximal_leakage], and local differential privacy [privacyaware]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in¹¹1The original results in [Asoode_MMSE_submitted, Calmon_principal_TIT] involve Rényi maximal correlation instead of $\chi^{2}$ -information. However, it can be shown that $\chi^{2}$ -information is equal to the sum of squares of the singular values of $f(Y)\mapsto{\mathbb{E}}[f(Y)|T]$ minus one (the largest one), while Rényi maximal correlation is equal to the second largest singular value [Witsenhausen:dependent]. Thus, $\chi^{2}$ -information upper bounds Rényi maximal correlation. [Asoode_MMSE_submitted, Calmon_principal_TIT] that if $\chi^{2}$ -information between $Y$ and $T$ is sufficiently small, then no functions of $Y$ can be efficiently reconstructed given $T$ ; thus providing an interpretable privacy guarantee.

Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from $n$ samples with the estimation error that scales as $O(\log n/\sqrt{n})$ [Shamir_IB]. However, as shown by Diaz et al. [Diaz_Robustness], the estimation error for most of the above measures scales as $O(1/\sqrt{n})$ . Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [MI_Estimator_Poole, MINE_Belghazi, Contrastive], presents some fundamental limitations [Understanding_Variational]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [Stratos_MI_Estimator] showed that some of these limitations are inherent to a large family of mutual information estimators.

I-B Notation

We use capital letters, e.g., $X$ , for random variables and calligraphic letters for their alphabets, e.g., ${\mathcal{X}}$ . If $X$ is distributed according to probability mass function (pmf) $P_{X}$ , we write $X\sim P_{X}$ . Given two random variables $X$ and $Y$ , we write $P_{XY}$ and $P_{Y|X}$ as the joint distribution and the conditional distribution of $Y$ given $X$ . We also interchangeably refer to $P_{Y|X}$ as a channel from $X$ to $Y$ . We use $H(X)$ to denote both entropy and differential entropy of $X$ , i.e., we have

H(X)=-\sum_{x\in{\mathcal{X}}}P_{X}(x)\log P_{X}(x)

if $X$ is a discrete random variable taking values in ${\mathcal{X}}$ with probability mass function (pmf) $P_{X}$ and

H(X)=-\int\log f_{X}(x)\log f_{X}(x)\text{d}x,

where $X$ is an absolutely continuous random variable with probability density function (pdf) $f_{X}$ . If $X$ is a binary random variable with $P_{X}(1)=p$ , we write $X\sim{\mathsf{Bernoulli}}(p)$ . In this case, its entropy is called binary entropy function and denoted by $h_{\mathsf{b}}(p)\coloneqq-p\log p-(1-p)\log(1-p)$ . We use superscript ${\mathsf{G}}$ to describe a standard Gaussian random variable, i.e., $N^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ . Given two random variables $X$ and $Y$ , their (Shannon’s) mutual information is denoted by $I(X;Y)\coloneqq H(Y)-H(Y|X)$ . We let ${\mathcal{P}}({\mathcal{X}})$ denote the set of all probability distributions on the set ${\mathcal{X}}$ . Given an arbitrary $Q_{X}\in{\mathcal{P}}({\mathcal{X}})$ and a channel $P_{Y|X}$ , we let $Q_{X}P_{Y|X}$ denote the resulting output distribution on ${\mathcal{Y}}$ . For any $a\in[0,1]$ , we use $\bar{a}$ to denote $1-a$ and for any integer $k\in\mathbb{N}$ , $[k]\coloneqq\{1,2,\dots,k\}$ .

Throughout the paper, we assume a pair of (discrete or continuous) random variables $(X,Y)\sim P_{XY}$ are given with a fixed joint distribution $P_{XY}$ , marginals $P_{X}$ and $P_{Y}$ , and conditional distribution $P_{Y|X}$ . We then use $Q_{X}\in{\mathcal{P}}({\mathcal{X}})$ to denote an arbitrary distribution with $Q_{Y}=Q_{X}P_{Y|X}\in{\mathcal{P}}({\mathcal{Y}})$ .

II Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.

To precisely quantify the trade-off between these two conflicting goals, the $\mathsf{IB}$ optimization problem (2) was proposed [tishby2000information]. Since any randomized function $T=F(X)$ can be equivalently characterized by a conditional distribution, (2) can be instead expressed as

\mathsf{IB}(P_{XY},R)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)\leq R\end{subarray}}I(Y;T),\qquad\text{or}\qquad\widetilde{\mathsf{IB}}(P_{XY},\tilde{R})\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(Y;T)\geq\tilde{R}\end{subarray}}I(X;T).

(4)

where $R$ and $\tilde{R}$ denote the level of desired compression and informativeness, respectively. We use $\mathsf{IB}(R)$ and $\widetilde{\mathsf{IB}}(\tilde{R})$ to denote $\mathsf{IB}(P_{XY},R)$ and $\widetilde{\mathsf{IB}}(P_{XY},\tilde{R})$ , respectively, when the joint distribution is clear from the context. Notice that if $\mathsf{IB}(P_{XY},R)=\tilde{R}$ , then $\widetilde{\mathsf{IB}}(P_{XY},\tilde{R})=R$ .

Now consider the setup where data $X$ is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by $Y$ . This goal was formulated by $\mathsf{PF}$ in (3). As before, replacing randomized function $T=F(X)$ with conditional distribution $P_{T|X}$ , we can equivalently express (3) as

\mathsf{PF}(P_{XY},r)\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)\geq r\end{subarray}}I(Y;T),\qquad\text{or}\qquad\widetilde{\mathsf{PF}}(P_{XY},\tilde{r})\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(Y;T)\leq\tilde{r}\end{subarray}}I(X;T),

(5)

where $\tilde{r}$ and $r$ denote the level of desired privacy and informativeness, respectively. The case $\tilde{r}=0$ is particularly interesting in practice and specifies perfect privacy, see e.g., [Calmon_fundamental-Limit, Rassouli_Perfect]. As before, we write $\widetilde{\mathsf{PF}}(\tilde{r})$ and $\mathsf{PF}(r)$ for $\widetilde{\mathsf{PF}}(P_{XY},\tilde{r})$ and $\mathsf{PF}(P_{XY},r)$ when $P_{XY}$ is clear from the context.

Refer to caption — (a) $|{\mathcal{X}}|=|{\mathcal{Y}}|=2$

The following properties of $\mathsf{IB}$ and $\mathsf{PF}$ follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix LABEL:Appendix_ProofSecIB_PF.

Theorem 1.

For a given $P_{XY}$ , the mappings $\mathsf{IB}(R)$ and $\mathsf{PF}(r)$ have the following properties:

•

$\mathsf{IB}(0)=\mathsf{PF}(0)=0$ .
•

$\mathsf{IB}(R)=I(X;Y)$ for any $R\geq H(X)$ and $\mathsf{PF}(r)=I(X;Y)$ for $r\geq H(X)$ .
•

$0\leq\mathsf{IB}(R)\leq\min\{R,I(X;Y)\}$ for any $R\geq 0$ and $\mathsf{PF}(r)\geq\max\{r-H(X|Y),0\}$ for any $r\geq 0$ .
•

$R\mapsto\mathsf{IB}(R)$ is continuous, strictly increasing, and concave on the range $(0,I(X;Y))$ .
•

$r\mapsto\mathsf{PF}(r)$ is continuous, strictly increasing, and convex on the range $(0,I(X;Y))$ .
•

If $P_{Y|X}(y|x)>0$ for all $x\in{\mathcal{X}}$ and $y\in{\mathcal{Y}}$ , then both $R\mapsto\mathsf{IB}(R)$ and $r\mapsto\mathsf{PF}(r)$ are continuously differentiable over $(0,H(X))$ .
•

$R\mapsto\frac{\mathsf{IB}(R)}{R}$ is non-increasing and $r\mapsto\frac{\mathsf{PF}(r)}{r}$ is non-decreasing.

•

We have

\mathsf{IB}(R)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)=R\end{subarray}}I(Y;T),\qquad\text{and}\qquad\mathsf{PF}(r)\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)=r\end{subarray}}I(Y;T).

According to this theorem, we can always restrict both $R$ and $r$ in (4) and (5), respectively, to $[0,H(X)]$ as $\mathsf{IB}(R)=\mathsf{PF}(r)=I(X;Y)$ for all $r,R\geq H(X)$ .

Define ${\mathcal{M}}={\mathcal{M}}(P_{XY})\subset\mathbb{R}^{2}$ as

{\mathcal{M}}\coloneqq\big{\{}(I(X;T),I(Y;T)):Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T,(X,Y)\sim P_{XY}\big{\}}.

(6)

It can be directly verified that ${\mathcal{M}}$ is convex. According to this theorem, $R\mapsto\mathsf{IB}(R)$ and $r\mapsto\mathsf{PF}(r)$ correspond to the upper and lower boundary of ${\mathcal{M}}$ , respectively. The convexity of ${\mathcal{M}}$ then implies the concavity and convexity of $\mathsf{IB}$ and $\mathsf{PF}$ . Fig. 1 illustrates the set ${\mathcal{M}}$ for the simple case of binary $X$ and $Y$ .

While both $\mathsf{IB}(0)=0$ and $\mathsf{PF}(0)=0$ , their behavior in the neighborhood around zero might be completely different. As illustrated in Fig. 1, $\mathsf{IB}(R)>0$ for all $R>0$ , whereas $\mathsf{PF}(r)=0$ for $r\in[0,r_{0}]$ for some $r_{0}>0$ . When such $r_{0}>0$ exists, we say perfect privacy occurs: there exists a variable $T$ satisfying $Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T$ such that $I(Y;T)=0$ while $I(X;T)>0$ ; making $T$ a representation of $X$ having perfect privacy (i.e., no information leakage about $Y$ ). A necessary and sufficient condition for the existence of such $T$ is given in [Asoode_submitted, Lemma 10] and [Calmon_fundamental-Limit, Theorem 3], described next.

Theorem 2 (Perfect privacy).

Let $(X,Y)\sim P_{XY}$ be given and ${\mathcal{A}}\subset[0,1]^{|{\mathcal{Y}}|}$ be the set of vectors $\{P_{Y|X}(\cdot|x),x\in{\mathcal{X}}\}$ . Then there exists $r_{0}>0$ such that $\mathsf{PF}(r)=0$ for $r\in[0,r_{0}]$ if and only if vectors in ${\mathcal{A}}$ are linearly independent.

In light of this theorem, we obtain that perfect privacy occurs if $|{\mathcal{X}}|>|{\mathcal{Y}}|$ . It also follows from the theorem that for binary $X$ , perfect privacy cannot occur (see Fig. 1(a)).

Theorem 1 enables us to derive a simple bounds for $\mathsf{IB}$ and $\mathsf{PF}$ . Specifically, the facts that $\frac{\mathsf{PF}(r)}{r}$ is non-decreasing and $\frac{\mathsf{IB}(R)}{R}$ is non-increasing immediately result in the the following linear bounds.

Theorem 3 (Linear lower bound).

For $r,R\in(0,H(X))$ , we have

\inf_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}\leq\frac{\mathsf{PF}(r)}{r}\leq\frac{I(X;Y)}{H(X)}\leq\frac{\mathsf{IB}(R)}{R}\leq\sup_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}\leq 1.

(7)

In light of this theorem, if $\mathsf{PF}(r)=r$ , then $I(X;Y)=H(X)$ , implying $X=g(Y)$ for a deterministic function $g$ . Conversely, if $X=g(Y)$ then $\mathsf{PF}(r)=r$ because for all $T$ forming the Markov relation $Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}g(Y)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T$ , we have $I(Y;T)=I(g(Y);T)$ . On the other hand, we have $\mathsf{IB}(R)=R$ if and only if there exists a variable $T^{*}$ satisfying $I(X;T^{*})=I(Y;T^{*})$ and thus the following double Markov relations

Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*},\qquad\text{and}\qquad X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*}.

It can be verified (see [csiszarbook, Problem 16.25]) that this double Markov condition is equivalent to the existence of a pair of functions $f$ and $g$ such that $f(X)=g(Y)$ and $(X,Y)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}f(X)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*}$ . One special case of this setting, namely where $g$ is an identity function, has been recently studied in details in [kolchinsky2018caveats] and will be reviewed in Section II-E. Theorem 3 also enables us to characterize the "worst" joint distribution $P_{XY}$ with respect to $\mathsf{IB}$ and $\mathsf{PF}$ . As demonstrated in the following lemma, if $P_{Y|X}$ is an erasure channel then $\frac{\mathsf{PF}(r)}{r}=\frac{\mathsf{IB}(R)}{R}=\frac{I(X;Y)}{H(X)}$ .

Lemma 1.

•

Let $P_{XY}$ be such that ${\mathcal{Y}}={\mathcal{X}}\cup\{\perp\}$ , $P_{Y|X}(x|x)=1-\delta$ , and $P_{Y|X}(\perp|x)=\delta$ for some $\delta>0$ . Then

$\frac{\mathsf{PF}(r)}{r}=\frac{\mathsf{IB}(R)}{R}=1-\delta.$
•

Let $P_{XY}$ be such that ${\mathcal{X}}={\mathcal{Y}}\cup\{\perp\}$ , $P_{X|Y}(y|y)=1-\delta$ , and $P_{X|Y}(\perp|y)=\delta$ for some $\delta>0$ . Then

$\mathsf{PF}(r)=\max\{r-H(X|Y),0\}.$

The bounds in Theorem 3 hold for all $r$ and $R$ in the interval $[0,H(X)]$ . We can, however, improve them when $r$ and $R$ are sufficiently small. Let $\mathsf{PF}^{\prime}(0)$ and $\mathsf{IB}^{\prime}(0)$ denote the slope of $\mathsf{PF}(\cdot)$ and $\mathsf{IB}(\cdot)$ at zero, i.e., $\mathsf{PF}^{\prime}(0)\coloneqq\lim_{r\to 0^{+}}\frac{\mathsf{PF}(r)}{r}$ and $\mathsf{IB}^{\prime}(0)\coloneqq\lim_{R\to 0^{+}}\frac{\mathsf{IB}(R)}{R}$ .

Theorem 4.

Given $(X,Y)\sim P_{XY}$ , we have

	$\displaystyle\inf_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\\|P_{X})}=\mathsf{PF}^{\prime}(0)$	$\displaystyle\leq\min_{\begin{subarray}{c}x\in{\mathcal{X}}:\\ P_{X}(x)>0\end{subarray}}\frac{D_{\mathsf{KL}}(P_{Y\|X}(\cdot\|x)\\|P_{Y}(\cdot))}{-\log P_{X}(x)}$
		$\displaystyle\leq\max_{\begin{subarray}{c}x\in{\mathcal{X}}:\\ P_{X}(x)>0\end{subarray}}\frac{D_{\mathsf{KL}}(P_{Y\|X}(\cdot\|x)\\|P_{Y}(\cdot))}{-\log P_{X}(x)}\leq\mathsf{IB}^{\prime}(0)=\sup_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\\|P_{X})}.$

This theorem provides the exact values of $\mathsf{PF}^{\prime}(0)$ and $\mathsf{IB}^{\prime}(0)$ and also simple bounds for them. Although the exact expressions for $\mathsf{PF}^{\prime}(0)$ and $\mathsf{IB}^{\prime}(0)$ are usually difficult to compute, a simple plug-in estimator is proposed in [Hypercontractivity_NIPS2017] for $\mathsf{IB}^{\prime}(0)$ . This estimator can be readily adapted to estimate $\mathsf{PF}^{\prime}(0)$ . Theorem 4 reveals a profound connection between $\mathsf{IB}$ and the strong data processing inequality (SDPI) [Ahlswede_Gacs]. More precisely, thanks to the pioneering work of Anantharam et al. [anantharam], it is known that the supremum of $\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}$ over all $Q_{X}\neq P_{X}$ is equal the supremum of $\frac{I(Y;T)}{I(X;T)}$ over all $P_{T|X}$ satisfying $Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T$ and hence $\mathsf{IB}^{\prime}(0)$ specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for $\mathsf{IB}$ , especially when $X$ or $Y$ are continuous random variables. In particular, the recent non-multiplicative SDPI results [Polyanskiy, calmon2015strong] seem insightful for this purpose.

In many practical cases, we might have $n$ i.i.d. samples $(X_{1},Y_{1}),\dots,(X_{n},Y_{n})$ of $(X,Y)\sim P_{XY}$ . We now study how $\mathsf{IB}$ behaves in $n$ . Let $X^{n}\coloneqq(X_{1},\dots,X_{n})$ and $Y^{n}\coloneqq(Y_{1},\dots,Y_{n})$ . Due to the i.i.d. assumption, we have $P_{X^{n}Y^{n}}(x^{n},y^{n})=\prod_{i=1}^{n}P_{XY}(x_{i},y_{i})$ . This can also be described by independently feeding $X_{i}$ , $i\in[n]$ , to channel $P_{Y|X}$ producing $Y_{i}$ . The following theorem, demonstrated first in [Witsenhausen_Wyner, Theorem 2.4], gives a formula for $\mathsf{IB}$ in terms of $n$ .

Theorem 5 (Additivity).

We have

\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)=\mathsf{IB}(P_{XY},R).

This theorem demonstrates that an optimal channel $P_{T^{n}|X^{n}}$ for i.i.d. samples $(X^{n},Y^{n})\sim P_{XY}$ is obtained by the Kronecker product of an optimal channel $P_{T|X}$ for $(X,Y)\sim P_{XY}$ . This, however, may not hold in general for $\mathsf{PF}$ , that is, we might have $\mathsf{PF}(P_{X^{n}Y^{n}},nr)<n\mathsf{PF}(P_{XY},r)$ , see [Calmon_fundamental-Limit, Proposition 1] for an example.

II-A Gaussian $\mathsf{IB}$ and $\mathsf{PF}$

In this section, we turn our attention to a special, yet important, case where $X=Y+\sigma N^{\mathsf{G}}$ , where $\sigma>0$ and $N^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ is independent of $Y$ . This setting subsumes the popular case of jointly Gaussian $(X,Y)$ whose information bottleneck functional was computed in [Bottleneck_Gaussian] for the vector case (i.e., $(X,Y)$ are jointly Gaussian random vectors).

Lemma 2.

Let $\{Y_{i}\}_{i=1}^{n}$ be $n$ i.i.d. copies of $Y\sim P_{Y}$ and $X_{i}=Y_{i}+\sigma N_{i}^{\mathsf{G}}$ where $\{N^{\mathsf{G}}_{i}\}$ are i.i.d samples of ${\mathcal{N}}(0,1)$ independent of $Y$ . Then, we have

\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq H(X)-\frac{1}{2}\log\left[2\pi e\sigma^{2}+e^{2(H(Y)-R)}\right].

It is worth noting that this result was concurrently proved in [Zaidi_GaussianCase]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [EPI_Courtade, Theorem 2] which holds even if $X_{i}$ , $Y_{i}$ , and $N_{i}$ are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on $P_{T|X}$ . This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian $(X,Y)$ , the optimal channel $P_{T|X}$ is Gaussian. This result was first proved in [GaussianIB] and used in [Bottleneck_Gaussian] to compute an expression of $\mathsf{IB}$ for the Gaussian case.

Corollary 1.

If $(X,Y)$ are jointly Gaussian with correlation coefficient $\rho$ , then we have

\mathsf{IB}(R)=\frac{1}{2}\log\frac{1}{1-\rho^{2}+\rho^{2}e^{-2R}}.

(8)

Moreover, the optimal channel $P_{T|X}$ is given by $P_{T|X}(\cdot|x)={\mathcal{N}}(0,\tilde{\sigma}^{2})$ for $\tilde{\sigma}^{2}=\sigma_{Y}^{2}\frac{e^{-2R}}{\rho^{2}(1-e^{-2R})}$ where $\sigma_{Y}^{2}$ is the variance of $Y$ .

In Lemma 2, we assumed that $X$ is a Gaussian perturbation of $Y$ . However, in some practical scenarios, we might have $Y$ as a Gaussian perturbation of $X$ . For instance, let $X$ represent an image and $Y$ be a feature of the image that can be perfectly obtained from a noisy observation of $X$ . Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [calmon2015strong, Theorem 1], gives an upper bound for $\mathsf{IB}$ in this case.

Lemma 3.

Let $X^{n}$ be $n$ i.i.d. copies of a random variable $X$ satisfying ${\mathbb{E}}[X^{2}]\leq 1$ and $Y_{i}$ be the result of passing $X_{i}$ , $i\in[n]$ , through a Gaussian channel $Y=X+\sigma N^{\mathsf{G}}$ , where $\sigma>0$ and $N^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ is independent of $X$ . Then, we have

\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq R-\Psi(R,\sigma),

(9)

where

\Psi(R,\sigma)\coloneqq\max_{x\in[0,\frac{1}{2}]}2\mathsf{Q}\left(\sqrt{\frac{1}{x\sigma^{2}}}\right)\left(R-h_{\mathsf{b}}(x)-\frac{x}{2}\log\left(1+\frac{1}{x\sigma^{2}}\right)\right),

(10)

$\mathsf{Q}(t)\coloneqq\int_{t}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}\textnormal{d}t$ is the Gaussian complimentary CDF and $h_{\mathsf{b}}(a)\coloneqq-a\log(a)-(1-a)\log(1-a)$ for $a\in(0,1)$ is the binary entropy function. Moreover, we have

\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq R-e^{-\frac{1}{R\sigma^{2}}\log\frac{1}{R}+\Theta\left(\log\frac{1}{R}\right)}.

(11)

Note that that Lemma 3 holds for any arbitrary $X$ (provided that ${\mathbb{E}}[X^{2}]\leq 1$ ) and hence (9) bounds information bottleneck functionals for a wide family of $P_{XY}$ . However, the bound is loose in general for large values of $R$ . For instance, if $(X,Y)$ are jointly Gaussian (implying $Y=X+\sigma N^{\mathsf{G}}$ for some $\sigma>0$ ), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Fig. 2.

The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in $\mathsf{PF}$ over $P_{T|X}$ without any assumptions is a difficult challenge. A natural assumption to make is that $P_{T|X}(\cdot|x)$ is Gaussian for each $x\in{\mathcal{X}}$ . This leads to the following variant of $\mathsf{PF}$

\mathsf{PF}^{\mathsf{G}}(r)\coloneqq\inf_{\begin{subarray}{c}\sigma\geq 0,\\ I(X;T_{\sigma})\geq r\end{subarray}}I(Y;T_{\sigma}),

where

T_{\sigma}\coloneqq X+\sigma N^{\mathsf{G}},

and $N^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ is independent of $X$ . This formulation is tractable and can be computed in closed form for jointly Gaussian $(X,Y)$ as described in the following example.

Example 1. Let $X$ and $Y$ be jointly Gaussian with correlation coefficient $\rho$ . First note that since mutual information is invariant to scaling, we may assume without loss of generality that both $X$ and $Y$ are zero mean and unit variance and hence we can write $X=\rho Y+\sqrt{1-\rho^{2}}M^{\mathsf{G}}$ where $M^{\mathsf{G}}\sim{\mathcal{N}}(0,1)$ is independent of $Y$ . Consequently, we have

I(X;T_{\sigma})=\frac{1}{2}\log\left(1+\frac{1}{\sigma^{2}}\right),

(12)

and

I(Y;T_{\sigma})=\frac{1}{2}\log\left(1+\frac{\rho^{2}}{1-\rho^{2}+\sigma^{2}}\right).

(13)

In order to ensure $I(X;T_{\sigma})\geq r$ , we must have $\sigma\leq\left(e^{2r}-1\right)^{-\frac{1}{2}}$ . Plugging this choice of $\sigma$ into (13), we obtain

\mathsf{PF}^{\mathsf{G}}(r)=\frac{1}{2}\log\left(\frac{1}{1-\rho^{2}\left(1-e^{-2r}\right)}\right).

(14)

This example indicates that for jointly Gaussian $(X,Y)$ , we have $\mathsf{PF}^{\mathsf{G}}(r)=0$ if and only if $r=0$ (thus perfect privacy does not occur) and the constraint $I(X;T_{\sigma})=r$ is satisfied by a unique $\sigma$ . These two properties in fact hold for all continuous variables $X$ and $Y$ with finite second moments as demonstrated in Lemma LABEL:Lemma:StrictCon_IYT in Appendix LABEL:Appendix_ProofSecIB_PF. We use these properties to derive a second-order approximation of $\mathsf{PF}^{\mathsf{G}}(r)$ when $r$ is sufficiently small. For the following theorem, we use ${\mathsf{var}}(U)$ to denote the variance of the random variable $U$ and ${\mathsf{var}}(U|V)\coloneqq{\mathbb{E}}[(U-{\mathbb{E}}[U|V])^{2}|V]$ . We use $\sigma^{2}_{X}={\mathsf{var}}(X)$ for short.

Theorem 6.

For any pair of continuous random variables $(X,Y)$ with finite second moments, we have as $r\to 0$

\mathsf{PF}^{\mathsf{G}}(r)=\eta(X,Y)r+\Delta(X,Y)r^{2}+o(r^{2}),

where $\eta(X,Y)\coloneqq\frac{{\mathsf{var}}({\mathbb{E}}[X|Y])}{\sigma_{X}^{2}}$ and

\Delta(X,Y)\coloneqq\frac{2}{\sigma^{4}_{X}}\left[{\mathbb{E}}[{\mathsf{var}}^{2}(X|Y)]-\sigma_{X}^{2}{\mathbb{E}}[{\mathsf{var}}(X|Y)]\right].

It is worth mentioning that the quantity $\eta(X,Y)$ was first defined by Rényi [Renyi-dependence-measure] as an asymmetric measure of correlation between $X$ and $Y$ . In fact, it can be shown that $\eta(X,Y)=\sup_{f}\rho^{2}(X,f(Y)),$ where supremum is taken over all measurable functions $f$ and $\rho(\cdot,\cdot)$ denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian $X$ and $Y$ with correlation coefficient $\rho$ for which $\mathsf{PF}^{\mathsf{G}}$ was computed in Example 2. In this case, it can be easily verified that $\eta(X,Y)=\rho^{2}$ and $\Delta(X,Y)=-2\sigma_{X}^{2}\rho^{2}(1-\rho^{2})$ . Hence, for jointly Gaussian $(X,Y)$ with correlation coefficient $\rho$ and unit variance, we have $\mathsf{PF}^{\mathsf{G}}(r)=\rho^{2}r-2\rho^{2}(1-\rho^{2})r^{2}+o(r^{2})$ . In Fig. 3, we compare the approximation given in Theorem 6 for this particular case.

II-B Evaluation of $\mathsf{IB}$ and $\mathsf{PF}$

The constrained optimization problems in the definitions of $\mathsf{IB}$ and $\mathsf{PF}$ are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both $\mathsf{IB}$ and $\mathsf{PF}$ are often approximated by their corresponding Lagrangian optimizations

{\mathcal{L}}_{\mathsf{IB}}(\beta)\coloneqq\sup_{P_{T|X}}I(Y;T)-\beta I(X;T)=H(Y)-\beta H(X)-\inf_{P_{T|X}}\left[H(Y|T)-\beta H(X|T)\right],

(15)

and

{\mathcal{L}}_{\mathsf{PF}}(\beta)\coloneqq\inf_{P_{T|X}}I(Y;T)-\beta I(X;T)=H(Y)-\beta H(X)-\sup_{P_{T|X}}\left[H(Y|T)-\beta H(X|T)\right],

(16)

where $\beta\in\mathbb{R}_{+}$ is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for $\mathsf{IB}$ and the privacy and informativeness in $\mathsf{PF}$ . Notice that for the computation of ${\mathcal{L}}_{\mathsf{IB}}$ , we can assume, without loss of generality, that $\beta\in[0,1]$ since otherwise the maximizer of (15) is trivial. It is worth noting that ${\mathcal{L}}_{\mathsf{IB}}(\beta)$ and ${\mathcal{L}}_{\mathsf{PF}}(\beta)$ in fact correspond to lines of slope $\beta$ supporting ${\mathcal{M}}$ from above and below, thereby providing a new representation of ${\mathcal{M}}$ .

Let $(X^{\prime},Y^{\prime})$ be a pair of random variables with $X^{\prime}\sim Q_{X}$ for some $Q_{X}\in{\mathcal{P}}({\mathcal{X}})$ and $Y^{\prime}$ is the output of $P_{Y|X}$ when the input is $X^{\prime}$ (i.e., $Y^{\prime}\sim Q_{X}P_{Y|X}$ ). Define

F_{\beta}(Q_{X})\coloneqq H(Y^{\prime})-\beta H(X^{\prime}).

This function, in general, is neither convex nor concave in $Q_{X}$ . For instance, $F(0)$ is concave and $F(1)$ is convex in $P_{X}$ . The lower convex envelope (resp. upper concave envelope) of $F_{\beta}(Q_{X})$ is defined as the largest (resp. smallest) convex (resp. concave) smaller (larger) than $F_{\beta}(Q_{X})$ . Let $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ and $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]$ denote the lower convex and upper concave envelopes of $F_{\beta}(Q_{X})$ , respectively. If $F_{\beta}(Q_{X})$ is convex at $P_{X}$ , that is $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}=F_{\beta}(P_{X})$ , then $F_{\beta}(Q_{X})$ remains convex at $P_{X}$ for all $\beta^{\prime}\geq\beta$ because

	$\displaystyle\mathcal{K}_{\mathsf{\cup}}[F_{\beta^{\prime}}(Q_{X})]$	$\displaystyle=\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})-(\beta^{\prime}-\beta)H(X^{\prime})]$
		$\displaystyle\geq\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]+\mathcal{K}_{\mathsf{\cup}}[-(\beta^{\prime}-\beta)H(X^{\prime})]$
		$\displaystyle=\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]-(\beta^{\prime}-\beta)H(X^{\prime}),$

where the last equality follows from the fact that $-(\beta^{\prime}-\beta)H(X)$ is convex. Hence, at $P_{X}$ we have

\mathcal{K}_{\mathsf{\cup}}[F_{\beta^{\prime}}(Q_{X})]\big{|}_{P_{X}}\geq\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}-(\beta^{\prime}-\beta)H(X)=F_{\beta}(P_{X})-(\beta^{\prime}-\beta)H(X)=F_{\beta^{\prime}}(P_{X}).

Analogously, if $F_{\beta}(Q_{X})$ is concave at $P_{X}$ , that is $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}=F_{\beta}(P_{X})$ , then $F_{\beta}(Q_{X})$ remains concave at $P_{X}$ for all $\beta^{\prime}\leq\beta$ .

Notice that, according to (15) and (16), we can write

{\mathcal{L}}_{\mathsf{IB}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}},

(17)

and

{\mathcal{L}}_{\mathsf{PF}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}.

(18)

In light of the above arguments, we can write

{\mathcal{L}}_{\mathsf{IB}}(\beta)=0,

for all $\beta>\beta_{\mathsf{IB}}$ where $\beta_{\mathsf{IB}}$ is the smallest $\beta$ such that $F_{\beta}(P_{X})$ touches $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ . Similarly,

{\mathcal{L}}_{\mathsf{PF}}(\beta)=0,

for all $\beta<\beta_{\mathsf{PF}}$ where $\beta_{\mathsf{PF}}$ is the largest $\beta$ such that $F_{\beta}(P_{X})$ touches $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]$ . In the following theorem, we show that $\beta_{\mathsf{IB}}$ and $\beta_{\mathsf{PF}}$ are given by the values of $\mathsf{IB}^{\prime}(0)$ and $\mathsf{PF}^{\prime}(0)$ , respectively, given in Theorem 4. A similar formulae $\beta_{\mathsf{IB}}$ and $\beta_{\mathsf{PF}}$ were given in [Learability_2019].

Proposition 1.

We have,

\beta_{\mathsf{IB}}=\sup_{Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})},

and

\beta_{\mathsf{PF}}=\inf_{Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}.

Kim et al. [Hypercontractivity_NIPS2017] have recently proposed an efficient algorithm to estimate $\beta_{\mathsf{IB}}$ from samples of $P_{XY}$ involving a simple optimization problem. This algorithm can be readily adapted for estimating $\beta_{\mathsf{PF}}$ . Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier $\beta$ , that is

{\mathcal{L}}_{\mathsf{IB}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}},\qquad\text{for}\qquad\beta\in[0,\beta_{\mathsf{IB}}],

(19)

and

{\mathcal{L}}_{\mathsf{PF}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}},\qquad\text{for}\qquad\beta\in[\beta_{\mathsf{PF}},\infty).

(20)

Remark 1.

As demonstrated by Kolchinsky et al. [kolchinsky2018caveats], the boundary points $0$ and $\beta_{\mathsf{IB}}$ are required for the computation of ${\mathcal{L}}_{\mathsf{IB}}(\beta)$ . In fact, when $Y$ is a deterministic function of $X$ , then only $\beta=0$ and $\beta=\beta_{\mathsf{IB}}$ are required to compute the $\mathsf{IB}$ and other values of $\beta$ are vacuous. The same argument can also be used to justify the inclusion of $\beta_{\mathsf{PF}}$ in computing ${\mathcal{L}}_{\mathsf{PF}}(\beta)$ . Note also that since $F_{\beta}(Q_{X})$ becomes convex for $\beta>\beta_{\mathsf{IB}}$ , computing $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]$ becomes trivial for such values of $\beta$ .

Remark 2.

Observe that the lower convex envelope of any function $f$ can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [Legendre_transformation_alg, Legendre_transformation_alg2]) for approximating $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ .

Once ${\mathcal{L}}_{\mathsf{IB}}(\beta)$ and ${\mathcal{L}}_{\mathsf{PF}}(\beta)$ are computed, we can derive $\mathsf{IB}$ and $\mathsf{PF}$ via standard results in optimization (see [Witsenhausen_Wyner, Section IV] for more details):

\mathsf{IB}(R)=\inf_{\beta\in[0,\beta_{\mathsf{IB}}]}\beta R+{\mathcal{L}}_{\mathsf{IB}}(\beta),

(21)

and

\mathsf{PF}(r)=\sup_{\beta\in[\beta_{\mathsf{PF}},\infty]}\beta r+{\mathcal{L}}_{\mathsf{PF}}(\beta).

(22)

Following the convex analysis approach outlined by Witsenhausen and Wyner [Witsenhausen_Wyner], $\mathsf{IB}$ and $\mathsf{PF}$ can be directly computed from ${\mathcal{L}}_{\mathsf{IB}}(\beta)$ and ${\mathcal{L}}_{\mathsf{PF}}(\beta)$ by observing the following. Suppose for some $\beta$ , $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ (resp. $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]$ ) at $P_{X}$ is obtained by a convex combination of points $F_{\beta}(Q^{i})$ , $i\in[k]$ for some $Q^{1},\dots,Q^{k}$ in ${\mathcal{P}}({\mathcal{X}})$ , integer $k\geq 2$ , and weights $\lambda_{i}\geq 0$ (with $\sum_{i}\lambda_{i}=1$ ). Then $\sum_{i}\lambda_{i}Q^{i}=P_{X}$ , and $T^{*}$ with properties $P_{T^{*}}(i)=\lambda_{i}$ and $P_{X|T^{*}=i}=Q^{i}$ attains the minimum (resp. maximum) of $H(Y|T)-\beta H(X|T)$ . Hence, $(I(X;T^{*}),I(Y;T^{*}))$ is a point on the upper (resp. lower) boundary of ${\mathcal{M}}$ ; implying that $\mathsf{IB}(R)=I(Y;T^{*})$ for $R=I(X;T^{*})$ (resp. $\mathsf{PF}(r)=I(Y;T^{*})$ for $r=I(X;T^{*})$ ). If for some $\beta$ , $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ at $P_{X}$ coincides with $F_{\beta}[P_{X}]$ , then this corresponds to ${\mathcal{L}}_{\mathsf{IB}}(\beta)=0$ . The same holds for $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ . Thus, all the information about the functional $\mathsf{IB}$ (resp. $\mathsf{PF}$ ) is contained in the subset of the domain of $\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]$ (resp. $\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]$ ) over which it differs from $F_{\beta}(Q_{X})$ . We will revisit and generalize this approach later in Section III.

We can now instantiate this for the binary symmetric case. Suppose $X$ and $Y$ are binary variables and $P_{Y|X}$ is binary symmetric channel with crossover probability $\delta$ , denoted by $\mathsf{BSC}(\delta)$ and defined as

\mathsf{BSC}(\delta)=\begin{bmatrix}1-\delta&\delta\\ \delta&1-\delta\end{bmatrix},

(23)

for some $\delta\geq 0$ . To describe the result in a compact fashion, we introduce the following notation: we let $h_{\mathsf{b}}:[0,1]\to[0,1]$ denote the binary entropy function, i.e., $h_{\mathsf{b}}(p)=-p\log p-(1-p)\log(1-p)$ . Since this function is strictly increasing $[0,\frac{1}{2}]$ , its inverse exists and is denoted by $h^{-1}_{\mathsf{b}}:[0,1]\to[0,\frac{1}{2}]$ . Also, $a*b\coloneqq a(1-b)+b(1-a)$ for $a,b\in[0,1]$ .

Lemma 4 (Mr. and Mrs. Gerber’s Lemma).

For $X\sim{\mathsf{Bernoulli}}(p)$ for $p\leq\frac{1}{2}$ and $P_{Y|X}=\mathsf{BSC}(\delta)$ for $\delta\geq 0$ , we have

\mathsf{IB}(R)=h_{\mathsf{b}}(p*\delta)-h_{\mathsf{b}}\left(\delta*h_{\mathsf{b}}^{-1}\big{(}h_{\mathsf{b}}(p)-R\big{)}\right),

(24)

and

\mathsf{PF}(r)=h_{\mathsf{b}}(p*\delta)-\alpha h_{\mathsf{b}}\left(\delta*\frac{p}{z}\right)-\bar{\alpha}h_{\mathsf{b}}\left(\delta\right),

(25)

where $r=h_{\mathsf{b}}(p)-\alpha h_{\mathsf{b}}\left(\frac{p}{z}\right)$ , $z=\max\left(\alpha,2p\right)$ , and $\alpha\in[0,1]$ .

The result in (24) was proved by Wyner and Ziv [Gerber] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping $F_{\beta}:[0,1]\to\mathbb{R}$ given by

F_{\beta}(q)\coloneqq F_{\beta}(Q_{X})=h_{\mathsf{b}}(q*\delta)-\beta h_{\mathsf{b}}(q),

(26)

where $q*\delta\coloneqq q\bar{\delta}+\delta\bar{q}$ is the output distribution of $\mathsf{BSC}(\delta)$ when the input distribution is ${\mathsf{Bernoulli}}(q)$ for some $q\in(0,1)$ . It can be verified that $\beta_{\mathsf{IB}}\leq(1-2\delta)^{2}$ . This function is depicted in Fig. 4 depending of the values of $\beta\leq(1-2\delta)^{2}$ .

II-C Operational Meaning of $\mathsf{IB}$ and $\mathsf{PF}$

In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both $\mathsf{IB}$ and $\mathsf{PF}$ . The operational interpretation of $\mathsf{IB}$ has recently been extensively studied in information-theoretic settings in [Bottleneck_Polyanskiy, Bottleneck_Shamai]. In particular, it was shown that $\mathsf{IB}$ specifies the rate-distortion region of noisy source coding problem [Noisy_SourceCoding_Dobrushin, Witsenhusen_Indirect] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [Wyner_SourceCoding]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which $\mathsf{IB}$ appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by $\mathsf{PF}$ . This in fact delineate an important difference between $\mathsf{IB}$ and $\mathsf{PF}$ : while $\mathsf{IB}$ describes the entire rate-region of an information-theoretic setup, $\mathsf{PF}$ specifies only a corner point of a rate region. Other information-theoretic settings related to $\mathsf{IB}$ and $\mathsf{PF}$ include CEO problem [Courtade_CEO] and source coding for the Gray-Wyner network [Cheuk_LI_GrayWyner].

II-C1 Noisy Source Coding

Suppose Alice has access only to a noisy version $X$ of a source of interest $Y$ . She wishes to transmit a rate-constrained description from her observation (i.e., $X$ ) to Bob such that he can recover $Y$ with small average distortion. More precisely, let $(X^{n},Y^{n})$ be $n$ i.i.d. samples of $(X,Y)\sim P_{XY}$ . Alice encodes her observation $X^{n}$ through an encoder $\phi:{\mathcal{X}}^{n}\to\{1,\dots,K_{n}\}$ and sends $\phi(X^{n})$ to Bob. Upon receiving $\phi(X^{n})$ , Bob reconstructs a "soft" estimate of $Y^{n}$ via a decoder $\psi:\{1,\dots,K_{n}\}\to\widehat{\mathcal{Y}}^{n}$ where $\widehat{\mathcal{Y}}={\mathcal{P}}({\mathcal{Y}})$ . That is, the reproduction sequence $\hat{y}^{n}$ consists of $n$ probability measures on ${\mathcal{Y}}$ . For any source and reproduction sequences $y^{n}$ and $\hat{y}^{n}$ , respectively, the distortion is defined as

d(y^{n},\hat{y}^{n})\coloneqq\frac{1}{n}\sum_{i=1}^{n}d(y_{i},\hat{y}_{i}),

where

d(y,\hat{y})\coloneqq\log\frac{1}{\hat{y}(y)}.

(27)

We say that a pair of rate-distortion $(\mathsf{R},\mathsf{D})$ is achievable if there exists a pair $(\phi,\psi)$ of encoder and decoder such that

\limsup_{n\to\infty}{\mathbb{E}}[d(Y^{n},\psi(\phi(X^{n})))]\leq\mathsf{D},\qquad\text{and}\qquad\limsup_{n\to\infty}\frac{1}{n}\log K_{n}\leq\mathsf{R}.

(28)

The noisy rate-distortion function $\mathsf{R}^{\mathsf{noisy}}(\mathsf{D})$ for a given $\mathsf{D}\geq 0$ , is defined as the minimum rate $\mathsf{R}$ such that $(\mathsf{R},\mathsf{D})$ is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [Noisy_SourceCoding_Dobrushin], who showed that $\mathsf{R}^{\mathsf{noisy}}(\mathsf{D})$ is analogous to the classical rate-distortion function

\displaystyle\mathsf{R}^{\mathsf{noisy}}(\mathsf{D})

\displaystyle=\inf_{\begin{subarray}{c}P_{\hat{Y}|X}:{\mathbb{E}}[d(Y,\hat{Y})]\leq\mathsf{D},\\ Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}\hat{Y}\end{subarray}}I(X;\hat{Y}).

(29)

It can be easily verified that ${\mathbb{E}}[d(Y,\hat{Y})]=H(Y|\hat{Y})$ and hence (after relabeling $\hat{Y}$ as $T$ )

\mathsf{R}^{\mathsf{noisy}}(\mathsf{D})=\inf_{\begin{subarray}{c}P_{T|X}:I(Y;T)\geq R,\\ Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\end{subarray}}I(X;T),

(30)

where $R=H(Y)-\mathsf{D}$ , which is equal to $\widetilde{\mathsf{IB}}$ defined in (4). For more details in connection between noisy source coding and $\mathsf{IB}$ , the reader is referred to [Bottleneck_Shamai, Bottleneck_Polyanskiy, Courtade_CEO, Collaborative_IB]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by

\lim_{n\to\infty}\frac{1}{n}I(Y^{n};\psi(\phi(X^{n})))\geq R,\qquad\text{and}\qquad\limsup_{n\to\infty}\frac{1}{n}\log K_{n}\leq\mathsf{R}.

This problem is addressed in [IB_operational] for discrete alphabets ${\mathcal{X}}$ and ${\mathcal{Y}}$ and extended recently in [IB_General] for any general alphabets.

II-C2 Test Against Independence with Communication Constraint

As mentioned earlier, the connection between $\mathsf{IB}$ and noisy source coding, described above, was known and studied in [Bottleneck_Shamai, Bottleneck_Polyanskiy]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for $\mathsf{IB}$ . Given $n$ i.i.d. samples $(X_{1},Y_{1}),\dots,(X_{n},Y_{n})$ from joint distribution $Q$ , we wish to test whether $X_{i}$ are independent of $Y_{i}$ , that is, $Q$ is a product distribution. This task is formulated by the following hypothesis test:

\displaystyle\begin{aligned} H_{0}:&\leavevmode\nobreak\ \leavevmode\nobreak\ Q=P_{XY},\\ H_{1}:&\leavevmode\nobreak\ \leavevmode\nobreak\ Q=P_{X}P_{Y},\end{aligned}

(31)

for a given joint distribution $P_{XY}$ with marginals $P_{X}$ and $P_{Y}$ . Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] investigated this problem under a communication constraint: While $Y$ observations (i.e., $Y_{1},\dots,Y_{n}$ ) are available, the $X$ observations need to be compressed at rate $R$ , that is, instead of $X^{n}$ , only $\phi(X^{n})$ is present where $\phi:{\mathcal{X}}^{n}\to\{1,\dots,K_{n}\}$ satisfies

\frac{1}{n}\log K_{n}\leq R.

For the type I error probability not exceeding a fixed $\varepsilon\in(0,1)$ , Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] derived the smallest possible type 2 error probability, defined as

\beta_{R}(n,\varepsilon)=\min_{\phi:{\mathcal{X}}^{n}\to[K]\atop\frac{1}{n}\log K_{n}\leq R}\min_{A\subset[K_{n}]\times{\mathcal{Y}}^{n}}\Big{\{}(P_{\phi(X^{n})}\times P_{Y^{n}})(A):\leavevmode\nobreak\ \leavevmode\nobreak\ P_{\phi(X^{n})\times Y^{n}}(A)\geq 1-\varepsilon\Big{\}}.

The following gives the asymptotic expression of $\beta_{R}(n,\varepsilon)$ for every $\varepsilon\in(0,1)$ . For the proof, refer to [Hypothesis_Testing_Ahslwede, Theorem 3].

Theorem 7 ([Hypothesis_Testing_Ahslwede]).

For every $R\geq 0$ and $\varepsilon\in(0,1)$ , we have

\lim_{n\to\infty}-\frac{1}{n}\log\beta_{R}(n,\varepsilon)=\mathsf{IB}(R).

In light of this theorem, $\mathsf{IB}(R)$ specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.

II-C3 Dependence Dilution

Inspired by the problems of information amplification [Cover_State_Amplification] and state masking [Merhav_state_masking], Asoodeh et al. [Asoode_submitted] proposed the dependence dilution setup as follows. Consider a source sequences $X^{n}$ of $n$ i.i.d. copies of $X\sim P_{X}$ . Alice observes the source $X^{n}$ and wishes to encode it via the encoder

f_{n}:{\mathcal{X}}^{n}\to\{1,2,\dots,2^{nR}\},

for some $R>0$ . The goal is to ensure that any user observing $f_{n}(X^{n})$ can construct a list, of fixed size, of sequences in ${\mathcal{X}}^{n}$ that contains likely candidates of the actual sequence $X^{n}$ while revealing negligible information about a correlated source $Y^{n}$ . To formulate this goal, consider the decoder

g_{n}:\{1,2,\dots,2^{nR}\}\to 2^{{\mathcal{X}}^{n}},

where $2^{{\mathcal{X}}^{n}}$ denotes the power set of ${\mathcal{X}}^{n}$ . A dependence dilution triple $(R,\Gamma,\Delta)\in\mathbb{R}^{3}_{+}$ is said to be achievable if, for any $\delta>0$ , there exists a pair of encoder and decoder $(f_{n},g_{n})$ such that for sufficiently large $n$

\Pr\left(X^{n}\notin g_{n}(J)\right)<\delta,

(32)

having fixed size $|g_{n}(J)|=2^{n(H(X)-\Gamma)},$ where $J=f_{n}(X^{n})$ and simultaneously

\frac{1}{n}I(Y^{n};J)\leq\Delta+\delta.

(33)

Notice that without side information $J$ , the decoder can only construct a list of size $2^{nH(X)}$ which contains $X^{n}$ with probability close to one. However, after $J$ is observed and the list $g_{n}(J)$ is formed, the decoder’s list size can be reduced to $2^{n(H(X)-\Gamma)}$ and thus reducing the uncertainty about $X^{n}$ by $n\Gamma\in[0,nH(X)]$ . This observation can be formalized to show (see [Cover_State_Amplification] for details) that the constraint (32) is equivalent to

\frac{1}{n}I(X^{n};J)\geq\Gamma-\delta,

(34)

which lower bounds the amount of information $J$ carries about $X^{n}$ . Built on this equivalent formulation, Asoodeh et al. [Asoode_submitted, Corollary 15] derived a necessary condition for the achievable dependence dilution triple.

Theorem 8 ([Asoode_submitted]).

Any achievable dependence dilution triple $(R,\Gamma,\Delta)$ satisfies

\begin{cases}R&\geq\Gamma\\ \Gamma&\leq I(X;T)\\ \Delta&\geq I(Y;T)-I(X;T)+\Gamma,\end{cases}

for some auxiliary random variable $T$ satisfying $Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T$ and taking $|{\mathcal{T}}|\leq|{\mathcal{X}}|+1$ values.

According to this theorem, $\mathsf{PF}(\Gamma)$ specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate $\Gamma$ . While this informs the operational interpretation of $\mathsf{PF}$ , Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple $(R,\Gamma,\Delta)$ . It is, however, not clear that $\mathsf{PF}$ characterizes the rate region of an information-theoretic setup.

The fact that $\mathsf{IB}$ fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable $T$ in $\mathsf{IB}$ can be improved to $|{\mathcal{X}}|$ instead of $|{\mathcal{X}}|+1$ .

II-D Cardinality Bound

Recall that in the definition of $\mathsf{IB}$ in (4), no assumption was imposed on the auxiliary random variable $T$ . A straightforward application of Carathéodory-Fenchel-Eggleston theorem²²2This is a strengthening of the original Carathéodory theorem when the underlying space is connected, see e.g., [Witsenhausen_Convexity, Section III] or [csiszarbook, Lemma 15.4]. reveals that $\mathsf{IB}$ is attained for $T$ taking values in a set ${\mathcal{T}}$ with cardinality $|{\mathcal{T}}|\leq|{\mathcal{X}}|+1$ . Here, we improve this bound and show that cardinality bound to $|{\mathcal{T}}|\leq|{\mathcal{X}}|$ .

Theorem 9.

For any joint distribution $P_{XY}$ and $R\in(0,H(X)]$ , information bottleneck $\mathsf{IB}(R)$ is achieved by $T$ taking at most $|{\mathcal{X}}|$ values.

The proof of this theorem hinges on the operational characterization of $\mathsf{IB}$ as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section II-C. Specifically, we first show that the extreme points of this region is achieved by $T$ taking $|{\mathcal{X}}|$ values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such $T$ . It must be mentioned that this result was already claimed by Harremoës and Tishby in [harremoes2007information] without proof.

In many practical scenarios, feature $X$ has a large alphabet. Hence, the bound $|{\mathcal{T}}|\leq|{\mathcal{X}}|$ , albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label $Y$ usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for $T$ in terms of $|{\mathcal{Y}}|$ , one can consider approximating $\mathsf{IB}$ assuming $T$ takes $N$ values. The following result, recently proved by Hirche and Winter [Hirche_IB_cardinality], is in this spirit.

Theorem 10 ([Hirche_IB_cardinality]).

For any $(X,Y)\sim P_{XY}$ , we have

\mathsf{IB}(R,N)\leq\mathsf{IB}(R)\leq\mathsf{IB}(R,N)+\delta(N),

where $\delta(N)=4N^{-\frac{1}{|{\mathcal{Y}}|}}\left[\log\frac{|{\mathcal{Y}}|}{4}+\frac{1}{|{\mathcal{Y}}|}\log N\right]$ and $\mathsf{IB}(R,N)$ denotes the information bottleneck functional (4) with the additional constraint that $|{\mathcal{T}}|\leq N$ .

Recall that, unlike $\mathsf{PF}$ , the graph of $\mathsf{IB}$ characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section II-C), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of $\mathsf{PF}$ translates into a worse cardinality bound than that of $\mathsf{IB}$ . In fact, for $\mathsf{PF}$ the cardinality bound $|{\mathcal{T}}|\leq|{\mathcal{X}}|+1$ cannot be improved in general. To demonstrate this, we numerically solve the optimization in $\mathsf{PF}$ assuming that $|{\mathcal{T}}|=|{\mathcal{X}}|$ when both $X$ and $Y$ are binary. As illustrated in Fig. 5, this optimization does not lead to a convex function, and hence, cannot be equal to $\mathsf{PF}$ .

II-E Deterministic Information Bottleneck

As mentioned earlier, $\mathsf{IB}$ formalizes an information-theoretic approach to clustering high-dimensional feature $X$ into cluster labels $T$ that preserve as much information about the label $Y$ as possible. The clustering label is assigned by the soft operator $P_{T|X}$ that solves the $\mathsf{IB}$ formulation (4) according to the rule: $X=x$ is likely assigned label $T=t$ if $D_{\mathsf{KL}}(P_{Y|x}\|P_{Y|t})$ is small where $P_{Y|t}=\sum_{x}P_{Y|x}P_{X|t}$ . That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [strouse2017dib] suggested the following variant of $\mathsf{IB}$ , termed as deterministic information bottleneck $\mathsf{dIB}$

\mathsf{dIB}(P_{XY},R)\coloneqq\sup_{\begin{subarray}{c}f:{\mathcal{X}}\to{\mathcal{T}},\\ H(f(X))\leq R\end{subarray}}I(Y;f(X)),

(35)

where the maximization is taken over all deterministic functions $f$ whose range is a finite set ${\mathcal{T}}$ . Similarly, one can define

\mathsf{dPF}(P_{XY},r)\coloneqq\inf_{\begin{subarray}{c}f:{\mathcal{X}}\to{\mathcal{T}},\\ H(f(X))\geq r\end{subarray}}I(Y;f(X)).

(36)

One way to ensure that $H(f(X))\leq R$ for a deterministic function $f$ is to restrict the cardinality of the range of $f$ : if $f:{\mathcal{X}}\to[e^{R}]$ then $H(f(X))$ is necessarily smaller than $R$ . Using this insight, we derive a lower for $\mathsf{dIB}(P_{XY},R)$ in the following lemma.

Lemma 5.

For any given $P_{XY}$ , we have

\mathsf{dIB}(P_{XY},R)\geq\frac{e^{R}-1}{|{\mathcal{X}}|}I(X;Y),

and

\mathsf{dPF}(P_{XY},r)\leq\frac{e^{r}-1}{|{\mathcal{X}}|}I(X;Y)+\Pr(X\geq e^{r})\log\frac{1}{\Pr(X\geq e^{r})}.

Note that both $R$ and $r$ are smaller than $H(X)$ and thus the multiplicative factors of $I(X;Y)$ in the lemma are smaller than one. In light of this lemma, we can obtain

\frac{e^{R}-1}{|{\mathcal{X}}|}I(X;Y)\leq\mathsf{IB}(R)\leq I(X;Y),

and

\mathsf{PF}(r)\leq\frac{e^{r}-1}{|{\mathcal{X}}|}I(X;Y)+\Pr(X\geq e^{r})\log\frac{1}{\Pr(X\geq e^{r})}.

In most of practical setups, $|{\mathcal{X}}|$ might be very large, making the above lower bound for $\mathsf{IB}$ vacuous. In the following lemma, we partially address this issue by deriving a bound independent of ${\mathcal{X}}$ when $Y$ is binary.

Lemma 6.

Let $P_{XY}$ be a joint distribution of arbitrary $X$ and binary $Y\sim{\mathsf{Bernoulli}}(q)$ for some $q\in(0,1)$ . Then, for any $R\geq\log 5$ we have

\mathsf{dIB}(P_{XY},R)\geq I(X;Y)-2\alpha h_{\mathsf{b}}\Big{(}\frac{I(X;Y)}{2\alpha(e^{R}-4)}\Big{)},

where $\alpha=\max\{\log\frac{1}{q},\log\frac{1}{1-q}\}$ .

III Family of Bottleneck Problems

In this section, we introduce a family of bottleneck problems by extending $\mathsf{IB}$ and $\mathsf{PF}$ to a large family of statistical measures. Similar to $\mathsf{IB}$ and $\mathsf{PF}$ , these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution $P_{XY}$ . Recall that $R\mapsto\mathsf{IB}(P_{XY},R)$ and $r\mapsto\mathsf{PF}(P_{XY},r)$ are the upper and lower boundary of the set ${\mathcal{M}}$ defined in (6) and expressed here again for convenience

{\mathcal{M}}=\big{\{}(I(X;T),I(Y;T)):Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T,(X,Y)\sim P_{XY}\big{\}}.

(37)

Since $P_{XY}$ is given, $H(X)$ and $H(Y)$ are fixed. Thus, in characterizing ${\mathcal{M}}$ it is sufficient to consider only $H(X|T)$ and $H(Y|T)$ . To generalize $\mathsf{IB}$ and $\mathsf{PF}$ , we must therefore generalize $H(X|T)$ and $H(Y|T)$ .

Given a joint distribution $P_{XY}$ and two non-negative real-valued functions $\Phi:{\mathcal{P}}({\mathcal{X}})\to\mathbb{R}^{+}$ and $\Psi:{\mathcal{P}}({\mathcal{Y}})\to\mathbb{R}^{+}$ , we define

\Phi(X|T)\coloneqq{\mathbb{E}}\left[\Phi(P_{X|T})\right]=\sum_{t\in{\mathcal{T}}}P_{T}(t)\Phi(P_{X|T=t}),

(38)

and

\Psi(Y|T)\coloneqq{\mathbb{E}}\left[\Psi(P_{Y|T})\right]=\sum_{t\in{\mathcal{T}}}P_{T}(t)\Psi(P_{Y|T=t}).

(39)

When $X\sim P_{X}$ and $Y\sim P_{Y}$ , we interchangeably write $\Phi(X)$ for $\Phi(P_{X})$ and $\Phi(Y)$ for $\Psi(P_{Y})$ .

These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Sections LABEL:Sec:Geussing and LABEL:Sec:Arimoto, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order $\alpha$ for $\Phi(Q_{X})=||Q_{X}||_{\alpha}$ , probability of correctly guessing for $\Phi(Q_{X})=||Q_{X}||_{\infty}$ , maximal correlation for binary case, and $f$ -information for $\Phi(Q_{X})$ given by $f$ -divergence. We are able to generate a family of bottleneck problems using different instantiations of $\Phi(X|T)$ and $\Psi(Y|T)$ in place of mutual information in $\mathsf{IB}$ and $\mathsf{PF}$ . As we argue later, these problems better capture the essence of "informativeness" and "privacy"; thus providing analytical and interpretable guarantees similar in spirit to $\mathsf{IB}$ and $\mathsf{PF}$ .

Computing these bottleneck problems in general boils down to the following optimization problems

{\mathsf{U}}_{\Phi,\Psi}(\zeta)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ \Phi(X|T)\leq\zeta\end{subarray}}\Psi(Y|T),

(40)

and

Bottleneck Problems: Information and Estimation-Theoretic View††thanks: This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.

Abstract

I Introduction

I-A Related Work

I-B Notation

II Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

Theorem 1.

Theorem 2 (Perfect privacy).

Theorem 3 (Linear lower bound).

Lemma 1.

Theorem 4.

Theorem 5 (Additivity).

II-A Gaussian 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

Lemma 2.

Corollary 1.

Lemma 3.

Theorem 6.

II-B Evaluation of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

Proposition 1.

Remark 1.

Remark 2.

Lemma 4 (Mr. and Mrs. Gerber’s Lemma).

II-C Operational Meaning of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

II-C1 Noisy Source Coding

II-C2 Test Against Independence with Communication Constraint

Theorem 7 ([Hypothesis_Testing_Ahslwede]).

II-C3 Dependence Dilution

Theorem 8 ([Asoode_submitted]).

II-D Cardinality Bound

Theorem 9.

Theorem 10 ([Hirche_IB_cardinality]).

II-E Deterministic Information Bottleneck

Lemma 5.

Lemma 6.

III Family of Bottleneck Problems

Bottleneck Problems:
Information and Estimation-Theoretic View^†^†thanks: This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.

II-A Gaussian $\mathsf{IB}$ and $\mathsf{PF}$

II-B Evaluation of $\mathsf{IB}$ and $\mathsf{PF}$

II-C Operational Meaning of $\mathsf{IB}$ and $\mathsf{PF}$