This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bottleneck Problems:
Information and Estimation-Theoretic Viewthanks: This work was supported in part by NSF under grants CIF 1922971, 1815361, 1742836, 1900750, and CIF CAREER 1845852.

Shahab Asoodeh and Flavio P. Calmon S. Asoodeh and F. P. Calmon are with School of Engineering and Applied Science, Harvard University (e-mails: {shahab, flavio}@seas.harvard.edu). Part of the results in this paper was presented at the International Symposium on Information Theory 2018 [Hsu_Generalizing].
Abstract

Information bottleneck (IB) and privacy funnel (PF) are two closely related optimization problems which have found applications in machine learning, design of privacy algorithms, capacity problems (e.g., Mrs. Gerber’s Lemma), strong data processing inequalities, among others. In this work, we first investigate the functional properties of IB and PF through a unified theoretical framework. We then connect them to three information-theoretic coding problems, namely hypothesis testing against independence, noisy source coding and dependence dilution. Leveraging these connections, we prove a new cardinality bound for the auxiliary variable in IB, making its computation more tractable for discrete random variables.

In the second part, we introduce a general family of optimization problems, termed as bottleneck problems, by replacing mutual information in IB and PF with other notions of mutual information, namely ff-information and Arimoto’s mutual information. We then argue that, unlike IB and PF, these problems lead to easily interpretable guarantee in a variety of inference tasks with statistical constraints on accuracy and privacy. Although the underlying optimization problems are non-convex, we develop a technique to evaluate bottleneck problems in closed form by equivalently expressing them in terms of lower convex or upper concave envelope of certain functions. By applying this technique to binary case, we derive closed form expressions for several bottleneck problems.

I Introduction

Optimization formulations that involve information-theoretic quantities (e.g., mutual information) have been instrumental in a variety of learning problems found in machine learning. A notable example is the information bottleneck (𝖨𝖡\mathsf{IB}) method [tishby2000information]. Suppose YY is a target variable and XX is an observable correlated variable with joint distribution PXYP_{XY}. The goal of 𝖨𝖡\mathsf{IB} is to learn a "compact" summary (aka bottleneck) TT of XX that is maximally "informative" for inferring YY. The bottleneck variable TT is assumed to be generated from XX by applying a random function FF to XX, i.e., T=F(X)T=F(X), in such a way that it is conditionally independent of YY given XX, that we denote by

YXT.Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T. (1)

The 𝖨𝖡\mathsf{IB} quantifies this goal by measuring the “compactness” of TT using the mutual information I(X;T)I(X;T) and, similarly, “informativeness” by I(Y;T)I(Y;T). For a given level of compactness R0R\geq 0, 𝖨𝖡\mathsf{IB} extracts the bottleneck variable TT that solves the constrained optimization problem

𝖨𝖡(R)supI(Y;T)subject  toI(X;T)R,\mathsf{IB}(R)\coloneqq\sup\leavevmode\nobreak\ I(Y;T)\qquad\text{subject\leavevmode\nobreak\ to}\qquad I(X;T)\leq R, (2)

where the supremum is taken over all randomized functions T=F(X)T=F(X) satisfying YXTY\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T.

The optimization problem that underlies the information bottleneck has been studied in the information theory literature as early as the 1970’s — see [Gerber, Witsenhausen_Wyner, Gerbator_Ahlswede1977, Wyner_Gerber] — as a technique to prove impossibility results in information theory and also to study the common information between XX and YY. Wyner and Ziv [Gerber] explicitly determined the value of 𝖨𝖡(R)\mathsf{IB}(R) for the special case of binary XX and YY — a result widely known as Mrs. Gerber’s Lemma [Gerber, networkinfotheory]. More than twenty years later, the information bottleneck function was studied by Tishby et al. [tishby2000information] and re-formulated in a data analytic context. Here, the random variable XX represents a high-dimensional observation with a corresponding low-dimensional feature YY. 𝖨𝖡\mathsf{IB} aims at specifying a compressed description of image which is maximally informative about feature YY. This framework led to several applications in clustering [slonim2000document, IB_clustering, Agglomerative_IB] and quantization [Quantization_IB, Quantization2_IB].

A closely-related framework to 𝖨𝖡\mathsf{IB} is the privacy funnel (𝖯𝖥\mathsf{PF}) problem [Makhdoumi2014FromTI, Calmon_fundamental-Limit, Asoodeh_Allerton]. In the 𝖯𝖥\mathsf{PF} framework, a bottleneck variable TT is sought to maximally preserve "information" contained in XX while revealing as little about YY as possible. This framework aims to capture the inherent trade-off between revealing XX perfectly and leaking a sensitive attribute YY. For instance, suppose a user wishes to share an image XX for some classification tasks. The image might carry information about attributes, say YY, that the user might consider as sensitive, even when such information is of limited use for the tasks, e.g, location, or emotion. The 𝖯𝖥\mathsf{PF} framework seeks to extract a representation of XX from which the original image can be recovered with maximal accuracy while minimizing the privacy leakage with respect to YY. Using mutual information for both privacy leakage and informativeness, the privacy funnel can be formulated as

𝖯𝖥(r)infI(Y;T)subject  toI(X;T)r,\mathsf{PF}(r)\coloneqq\inf\leavevmode\nobreak\ I(Y;T)\qquad\text{subject\leavevmode\nobreak\ to}\qquad I(X;T)\geq r, (3)

where the infumum is taken over all randomized function T=F(X)T=F(X) and rr is the parameter specifying the level of informativeness. It is evident from the formulations (2) and (3) that 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are closely related. In fact, we shall see later that they correspond to the upper and lower boundaries of a two-dimensional compact convex set. This duality has led to design of greedy algorithms [Makhdoumi2014FromTI, Sadeghi_PF] for estimating 𝖯𝖥\mathsf{PF} based on the agglomerative information bottleneck [Agglomerative_IB] algorithm. A similar formulation has recently been proposed in [PF_Adverserially] as a tool to train a neural network for learning a private representation of data XX. Solving 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} optimization problems analytically is challenging. However, recent machine learning applications, and deep learning algorithms in particular, have reignited the study of both 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} (see Related Work).

In this paper, we first give a cohesive overview of the existing results surrounding the 𝖨𝖡\mathsf{IB} and the 𝖯𝖥\mathsf{PF} formulations. We then provide a comprehensive analysis of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} from an information-theoretic perspective, as well as a survey of several formulations connected to the 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} that have been introduced in the information theory and machine learning literature. Moreover, we overview connections with coding problems such as remote source-coding [Noisy_SourceCoding_Dobrushin], testing against independence [Hypothesis_Testing_Ahslwede], and dependence dilution [Asoode_submitted]. Leveraging these connections, we prove a new cardinality bound for the bottleneck variable in 𝖨𝖡\mathsf{IB}, leading to more tractable optimization problem for 𝖨𝖡\mathsf{IB}. We then consider a broad family of optimization problems by going beyond mutual information in formulations (2) and (3). We propose two candidates for this task: Arimoto’s mutual information [Arimoto_Original_Paper] and ff-information [Maxim_Strong_TIT]. By replacing I(Y;T)I(Y;T) and/or I(X;T)I(X;T) with either of these measures, we generate a family of optimization problems that we referred to as the bottleneck problems. These problems are shown to better capture the underlying trade-offs intended by 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. More specifically, our main contributions are listed next.

  • Computing 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are notoriously challenging when XX takes values in a set with infinite cardinality (e.g., XX is drawn from a continuous probability distribution). We consider three different scenarios to circumvent this difficulty. First, we assume that XX is a Gaussian perturbation of YY, i.e., X=Y+N𝖦X=Y+N^{\mathsf{G}} where N𝖦N^{\mathsf{G}} is a noise variable sampled from a Gaussian distribution independent of YY. Building upon the recent advances in entropy power inequality in [EPI_Courtade], we derive a sharp upper bound for 𝖨𝖡(R)\mathsf{IB}(R). As a special case, we consider jointly Gaussian (X,Y)(X,Y) for which the upper bound becomes tight. This then provides a significantly simpler proof for the fact that in this special case the optimal bottleneck variable TT is also Gaussian than the original proof given in [GaussianIB]. In the second scenario, we assume that YY is a Gaussian perturbation of XX, i.e., Y=X+N𝖦Y=X+N^{\mathsf{G}}. This corresponds to a practical setup where the feature YY might be perfectly obtained from a noisy observation of XX. Relying on the recent results in strong data processing inequality [calmon2015strong], we obtain an upper bound on 𝖨𝖡(R)\mathsf{IB}(R) which is tight for small values of RR. In the last scenario, we compute second-order approximation of 𝖯𝖥(r)\mathsf{PF}(r) under the assumption that TT is obtained by Gaussian perturbation of XX, i.e., T=X+N𝖦T=X+N^{\mathsf{G}}. Interestingly, the rate of increase of 𝖯𝖥(r)\mathsf{PF}(r) for small values of rr is shown to be dictated by an asymmetric measure of dependence introduced by Rényi [Renyi-dependence-measure].

  • We extend the Witsenhausen and Wyner’s approach [Witsenhausen_Wyner] for analytically computing 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. This technique converts solving the optimization problems in 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} to determining the convex and concave envelopes of a certain function, respectively. We apply this technique to binary XX and YY and derive a closed form expression for 𝖯𝖥(r)\mathsf{PF}(r)– we call this result Mr. Gerber’s Lemma.

  • Relying on the connection between 𝖨𝖡\mathsf{IB} and noisy source coding [Noisy_SourceCoding_Dobrushin] (see [Bottleneck_Polyanskiy, Bottleneck_Shamai]), we show that the optimal bottleneck variable TT in optimization problem (2) takes values in a set 𝒯{\mathcal{T}} with cardinality |𝒯||𝒳||{\mathcal{T}}|\leq|{\mathcal{X}}|. Compared to the best cardinality bound previously known (i.e., |𝒯||𝒳|+1|{\mathcal{T}}|\leq|{\mathcal{X}}|+1), this result leads to a reduction in the search space’s dimension of the optimization problem (2) from |𝒳|2\mathbb{R}^{|{\mathcal{X}}|^{2}} to |𝒳|(|𝒳|1)\mathbb{R}^{|{\mathcal{X}}|(|{\mathcal{X}}|-1)}. Moreover, we show that this does not hold for 𝖯𝖥\mathsf{PF}, indicating a fundamental difference in optimizations problems (2) and (3).

  • Following [strouse2017dib, Asoodeh_Allerton], we study the deterministic 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} (denoted by 𝖽𝖨𝖡\mathsf{dIB} and 𝖽𝖯𝖥\mathsf{dPF}) in which TT is assumed to be a deterministic function of XX, i.e., T=f(X)T=f(X) for some function ff. By connecting 𝖽𝖨𝖡\mathsf{dIB} and 𝖽𝖯𝖥\mathsf{dPF} with entropy-constrained scalar quantization problems in information theory [Polyanskiy_Distilling], we obtain bounds on them explicitly in terms of |𝒳||{\mathcal{X}}|. Applying these bounds to 𝖨𝖡\mathsf{IB}, we obtain that 𝖨𝖡(R)I(X;Y)\frac{\mathsf{IB}(R)}{I(X;Y)} is bounded by one from above and by min{RH(X),eR1|𝒳|}\min\{\frac{R}{H(X)},\frac{e^{R}-1}{|{\mathcal{X}}|}\} from below.

  • By replacing I(Y;T)I(Y;T) and/or I(X;T)I(X;T) in (2) and (3) with Arimoto’s mutual information or ff-information, we generate a family of bottleneck problems. We then argue that these new functionals better describe the trade-offs that were intended to be captured by 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. The main reason is three-fold: First, as illustrated in Section II-C, mutual information in 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are mainly justified when n1n\gg 1 independent samples (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) of PXYP_{XY} are considered. However, Arimoto’s mutual information allows for operational interpretation even in the single-shot regime (i.e., for n=1n=1). Second, I(Y;T)I(Y;T) in 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} is meant to be a proxy for the efficiency of reconstructing YY given observation TT. However, this can be accurately formalized by probability of correctly guessing YY given TT (i.e., Bayes risk) or minimum mean-square error (MMSE) in estimating YY given TT. While I(Y;T)I(Y;T) bounds these two measures, we show that they are precisely characterized by Arimoto’s mutual information and ff-information, respectively. Finally, when PXYP_{XY} is unknown, mutual information is known to be notoriously difficult to estimate. Nevertheless, Arimoto’s mutual information and ff-information are easier to estimate: While mutual information can be estimated with estimation error that scales as O(logn/n)O(\log n/\sqrt{n}) [Shamir_IB], Diaz et a. [Diaz_Robustness] showed that this estimation error for Arimoto’s mutual information and ff-information is O(1/n)O(1/\sqrt{n}).

    We also generalize our computation technique that enables us to analytically compute these bottleneck problems. Similar as before, this technique converts computing bottleneck problems to determining convex and concave envelopes of certain functions. Focusing on binary XX and YY, we derive closed form expressions for some of the bottleneck problems.

I-A Related Work

The 𝖨𝖡\mathsf{IB} formulation has been extensively applied in representation learning and clustering [IB_clustering, IB_DocumentClustering, IB_DoubleClustering, IB_Hidden, Zaidi_distributedIB, Zaidi2019distributed]. Clustering based on 𝖨𝖡\mathsf{IB} results in algorithms that cluster data points in terms of the similarity of PY|XP_{Y|X}. When data points lie in a metric space, usually geometric clustering is preferred where clustering is based upon the geometric (e.g., Euclidean) distance. Strouse and Schwab [strouse2017dib, strouse2019clustering] proposed the deterministic 𝖨𝖡\mathsf{IB} (denoted by 𝖽𝖨𝖡\mathsf{dIB}) by enforcing that PT|XP_{T|X} is a deterministic mapping: 𝖽𝖨𝖡(R)\mathsf{dIB}(R) denotes the supremum of I(Y;f(X))I(Y;f(X)) over all functions f:𝒳𝒯f:{\mathcal{X}}\to{\mathcal{T}} satisfying H(f(X))RH(f(X))\leq R. This optimization problem is closely related to the problem of scalar quantization in information theory: designing a function f:𝒳[M]{1,,M}f:{\mathcal{X}}\to[M]\coloneqq\{1,\dots,M\} with a pre-determined output alphabet with ff optimizing some objective functions. This objective might be maximizing or minimizing H(f(X))H(f(X)) [Cicalese] or maximizing I(Y;f(X))I(Y;f(X)) for a random variable YY correlated with XX [Polyanskiy_Distilling, Lapidoth_Koch, LDPC1_quantization, LDPC2_quantization]. Since H(f(X))logMH(f(X))\leq\log M for f:𝒳[M]f:{\mathcal{X}}\to[M], the latter problem provides lower bounds for 𝖽𝖨𝖡\mathsf{dIB} (and thus for 𝖨𝖡\mathsf{IB}). In particular, one can exploit [LDPC3_quantization, Theorem 1] to obtain I(X;Y)𝖽𝖨𝖡(R)O(e2R/|𝒴|1)I(X;Y)-\mathsf{dIB}(R)\leq O(e^{-2R/|{\mathcal{Y}}|-1}) provided that min{|𝒳|,2R}>2|𝒴|\min\{|{\mathcal{X}}|,2^{R}\}>2|{\mathcal{Y}}|. This result establishes a linear gap between 𝖽𝖨𝖡\mathsf{dIB} and I(X;Y)I(X;Y) irrespective of |𝒳||{\mathcal{X}}|.

The connection between quantization and 𝖽𝖨𝖡\mathsf{dIB} further allows us to obtain multiplicative bounds. For instance, if Y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12)Y\sim{\mathsf{Bernoulli}}(\frac{1}{2}) and X=Y+N𝖦X=Y+N^{\mathsf{G}}, where N𝖦𝒩(0,1)N^{\mathsf{G}}\sim{\mathcal{N}}(0,1) is independent of YY, then it is well-known in information theory literature that I(Y;f(X))2πI(X;Y)I(Y;f(X))\geq\frac{2}{\pi}I(X;Y) for all non-constant f:𝒳{0,1}f:{\mathcal{X}}\to\{0,1\} (see, e.g., [Viterbi, Section 2.11]), thus 𝖽𝖨𝖡(R)2πI(X;Y)\mathsf{dIB}(R)\geq\frac{2}{\pi}I(X;Y) for R1R\leq 1. We further explore this connection to provide multiplicative bounds on 𝖽𝖨𝖡(R)\mathsf{dIB}(R) in Section II-E.

The study of 𝖨𝖡\mathsf{IB} has recently gained increasing traction in the context of deep learning. By taking TT to be the activity of the hidden layer(s), Tishby and Zaslavsky [tishby2015deep] (see also [IB_DP_openBox]) argued that neural network classifiers trained with cross-entropy loss and stochastic gradient descent (SGD) inherently aims at solving the 𝖨𝖡\mathsf{IB} optimization problems. In fact, it is claimed that the graph of the function R𝖨𝖡(R)R\mapsto\mathsf{IB}(R) (the so-called the information plane) characterizes the learning dynamic of different layers in the network: shallow layers correspond to maximizing I(Y;T)I(Y;T) while deep layers’ objective is minimizing I(X;T)I(X;T). While the generality of this claim was refuted empirically in [On_IB_DL] and theoretically in [Inf_flow_IB_Polyiansky, Amjad_IB], it inspired significant follow-up studies. These include (i) modifying neural network training in order to solve the 𝖨𝖡\mathsf{IB} optimization problem [alemi2016deep, kolchinsky2017nonlinear, kolchinsky2018caveats, ReleventSparseCode, wickstrom2020information]; (ii) creating connections between 𝖨𝖡\mathsf{IB} and generalization error [Piantanida_roleIB], robustness [alemi2016deep], and detection of out-of-distribution data [Alemi_Uncertainity]; and (iii) using 𝖨𝖡\mathsf{IB} to understand specific characteristic of neural networks [Yu2018UnderstandingCN, Cheng2018EvaluatingCO, wickstrom2020information, Higgins2017betaVAELB].

In both 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}, mutual information poses some limitations. For instance, it may become infinity in deterministic neural networks [On_IB_DL, Inf_flow_IB_Polyiansky, Amjad_IB] and also may not lead to proper privacy guarantee [Issa_Leakage_TIT]. As suggested in [wickstrom2020information, Sufficient_Statistics], one way to address this issue is to replace mutual information with other statistical measures. In the privacy literature, several measures with strong privacy guarantee have been proposed including Rényi maximal correlation [Asoodeh_CWIT, Asoode_submitted, Fawaz_Makhdoumi], probability of correctly recovering [Asoodeh_TIT19, Asoode_ISIT17], minimum mean-squared estimation error (MMSE) [Asoode_MMSE_submitted, Calmon_principal_TIT], χ2\chi^{2}-information [Hao_Privacy_estimation] (a special case of ff-information to be described in Section III), Arimoto’s and Sibson’s mutual information [Shahab_PhD_thesis, Issa_Leakage_TIT] – to be discussed in Section III, maximal leakage [Liao_maximal_leakage], and local differential privacy [privacyaware]. All these measures ensure interpretable privacy guarantees. For instance, it is shown in111The original results in [Asoode_MMSE_submitted, Calmon_principal_TIT] involve Rényi maximal correlation instead of χ2\chi^{2}-information. However, it can be shown that χ2\chi^{2}-information is equal to the sum of squares of the singular values of f(Y)𝔼[f(Y)|T]f(Y)\mapsto{\mathbb{E}}[f(Y)|T] minus one (the largest one), while Rényi maximal correlation is equal to the second largest singular value [Witsenhausen:dependent]. Thus, χ2\chi^{2}-information upper bounds Rényi maximal correlation. [Asoode_MMSE_submitted, Calmon_principal_TIT] that if χ2\chi^{2}-information between YY and TT is sufficiently small, then no functions of YY can be efficiently reconstructed given TT; thus providing an interpretable privacy guarantee.

Another limitation of mutual information is related to its estimation difficulty. It is known that mutual information can be estimated from nn samples with the estimation error that scales as O(logn/n)O(\log n/\sqrt{n}) [Shamir_IB]. However, as shown by Diaz et al. [Diaz_Robustness], the estimation error for most of the above measures scales as O(1/n)O(1/\sqrt{n}). Furthermore, the recently popular variational estimators for mutual information, typically implemented via deep learning methods [MI_Estimator_Poole, MINE_Belghazi, Contrastive], presents some fundamental limitations [Understanding_Variational]: the variance of the estimator might grow exponentially with the ground truth mutual information and also the estimator might not satisfy basic properties of mutual information such as data processing inequality or additivity. McAllester and Stratos [Stratos_MI_Estimator] showed that some of these limitations are inherent to a large family of mutual information estimators.

I-B Notation

We use capital letters, e.g., XX, for random variables and calligraphic letters for their alphabets, e.g., 𝒳{\mathcal{X}}. If XX is distributed according to probability mass function (pmf) PXP_{X}, we write XPXX\sim P_{X}. Given two random variables XX and YY, we write PXYP_{XY} and PY|XP_{Y|X} as the joint distribution and the conditional distribution of YY given XX. We also interchangeably refer to PY|XP_{Y|X} as a channel from XX to YY. We use H(X)H(X) to denote both entropy and differential entropy of XX, i.e., we have

H(X)=x𝒳PX(x)logPX(x)H(X)=-\sum_{x\in{\mathcal{X}}}P_{X}(x)\log P_{X}(x)

if XX is a discrete random variable taking values in 𝒳{\mathcal{X}} with probability mass function (pmf) PXP_{X} and

H(X)=logfX(x)logfX(x)dx,H(X)=-\int\log f_{X}(x)\log f_{X}(x)\text{d}x,

where XX is an absolutely continuous random variable with probability density function (pdf) fXf_{X}. If XX is a binary random variable with PX(1)=pP_{X}(1)=p, we write X𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p)X\sim{\mathsf{Bernoulli}}(p). In this case, its entropy is called binary entropy function and denoted by h𝖻(p)plogp(1p)log(1p)h_{\mathsf{b}}(p)\coloneqq-p\log p-(1-p)\log(1-p). We use superscript 𝖦{\mathsf{G}} to describe a standard Gaussian random variable, i.e., N𝖦𝒩(0,1)N^{\mathsf{G}}\sim{\mathcal{N}}(0,1). Given two random variables XX and YY, their (Shannon’s) mutual information is denoted by I(X;Y)H(Y)H(Y|X)I(X;Y)\coloneqq H(Y)-H(Y|X). We let 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) denote the set of all probability distributions on the set 𝒳{\mathcal{X}}. Given an arbitrary QX𝒫(𝒳)Q_{X}\in{\mathcal{P}}({\mathcal{X}}) and a channel PY|XP_{Y|X}, we let QXPY|XQ_{X}P_{Y|X} denote the resulting output distribution on 𝒴{\mathcal{Y}}. For any a[0,1]a\in[0,1], we use a¯\bar{a} to denote 1a1-a and for any integer kk\in\mathbb{N}, [k]{1,2,,k}[k]\coloneqq\{1,2,\dots,k\}.

Throughout the paper, we assume a pair of (discrete or continuous) random variables (X,Y)PXY(X,Y)\sim P_{XY} are given with a fixed joint distribution PXYP_{XY}, marginals PXP_{X} and PYP_{Y}, and conditional distribution PY|XP_{Y|X}. We then use QX𝒫(𝒳)Q_{X}\in{\mathcal{P}}({\mathcal{X}}) to denote an arbitrary distribution with QY=QXPY|X𝒫(𝒴)Q_{Y}=Q_{X}P_{Y|X}\in{\mathcal{P}}({\mathcal{Y}}).

II Information Bottleneck and Privacy Funnel: Definitions and Functional Properties

In this section, we review the information bottleneck and its closely related functional, the privacy funnel. We then prove some analytical properties of these two functionals and develop a convex analytic approach which enables us to compute closed-form expressions for both these two functionals in some simple cases.

To precisely quantify the trade-off between these two conflicting goals, the 𝖨𝖡\mathsf{IB} optimization problem (2) was proposed [tishby2000information]. Since any randomized function T=F(X)T=F(X) can be equivalently characterized by a conditional distribution, (2) can be instead expressed as

𝖨𝖡(PXY,R)supPT|X:YXTI(X;T)RI(Y;T),or𝖨𝖡~(PXY,R~)infPT|X:YXTI(Y;T)R~I(X;T).\mathsf{IB}(P_{XY},R)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)\leq R\end{subarray}}I(Y;T),\qquad\text{or}\qquad\widetilde{\mathsf{IB}}(P_{XY},\tilde{R})\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(Y;T)\geq\tilde{R}\end{subarray}}I(X;T). (4)

where RR and R~\tilde{R} denote the level of desired compression and informativeness, respectively. We use 𝖨𝖡(R)\mathsf{IB}(R) and 𝖨𝖡~(R~)\widetilde{\mathsf{IB}}(\tilde{R}) to denote 𝖨𝖡(PXY,R)\mathsf{IB}(P_{XY},R) and 𝖨𝖡~(PXY,R~)\widetilde{\mathsf{IB}}(P_{XY},\tilde{R}), respectively, when the joint distribution is clear from the context. Notice that if 𝖨𝖡(PXY,R)=R~\mathsf{IB}(P_{XY},R)=\tilde{R}, then 𝖨𝖡~(PXY,R~)=R\widetilde{\mathsf{IB}}(P_{XY},\tilde{R})=R.

Now consider the setup where data XX is required to be disclosed while maintaining the privacy of a sensitive attribute, represented by YY. This goal was formulated by 𝖯𝖥\mathsf{PF} in (3). As before, replacing randomized function T=F(X)T=F(X) with conditional distribution PT|XP_{T|X}, we can equivalently express (3) as

𝖯𝖥(PXY,r)infPT|X:YXTI(X;T)rI(Y;T),or𝖯𝖥~(PXY,r~)supPT|X:YXTI(Y;T)r~I(X;T),\mathsf{PF}(P_{XY},r)\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)\geq r\end{subarray}}I(Y;T),\qquad\text{or}\qquad\widetilde{\mathsf{PF}}(P_{XY},\tilde{r})\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(Y;T)\leq\tilde{r}\end{subarray}}I(X;T), (5)

where r~\tilde{r} and rr denote the level of desired privacy and informativeness, respectively. The case r~=0\tilde{r}=0 is particularly interesting in practice and specifies perfect privacy, see e.g., [Calmon_fundamental-Limit, Rassouli_Perfect]. As before, we write 𝖯𝖥~(r~)\widetilde{\mathsf{PF}}(\tilde{r}) and 𝖯𝖥(r)\mathsf{PF}(r) for 𝖯𝖥~(PXY,r~)\widetilde{\mathsf{PF}}(P_{XY},\tilde{r}) and 𝖯𝖥(PXY,r)\mathsf{PF}(P_{XY},r) when PXYP_{XY} is clear from the context.

Refer to caption
(a) |𝒳|=|𝒴|=2|{\mathcal{X}}|=|{\mathcal{Y}}|=2
Refer to caption
(b) |𝒳|=3,|𝒴|=2|{\mathcal{X}}|=3,|{\mathcal{Y}}|=2
Refer to caption
(c) |𝒳|=5,|𝒴|=2|{\mathcal{X}}|=5,|{\mathcal{Y}}|=2
Figure 1: Examples of the set {\mathcal{M}}, defined in (6). The upper and lower boundaries of this set correspond to 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}, respectively. It is worth noting that, while 𝖨𝖡(R)=0\mathsf{IB}(R)=0 only at R=0R=0, 𝖯𝖥(r)=0\mathsf{PF}(r)=0 holds in general for rr belonging to a non-trivial interval (only for |𝒳|>2|{\mathcal{X}}|>2). Also, note that in general neither upper nor lower boundaries are smooth. A sufficient condition for smoothness is PY|X(y|x)>0P_{Y|X}(y|x)>0 (see Theorem 1), thus both 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are smooth in the binary case.

The following properties of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} follow directly from their definitions. The proof of this result (and any other results in this section) is given in Appendix LABEL:Appendix_ProofSecIB_PF.

Theorem 1.

For a given PXYP_{XY}, the mappings 𝖨𝖡(R)\mathsf{IB}(R) and 𝖯𝖥(r)\mathsf{PF}(r) have the following properties:

  • 𝖨𝖡(0)=𝖯𝖥(0)=0\mathsf{IB}(0)=\mathsf{PF}(0)=0.

  • 𝖨𝖡(R)=I(X;Y)\mathsf{IB}(R)=I(X;Y) for any RH(X)R\geq H(X) and 𝖯𝖥(r)=I(X;Y)\mathsf{PF}(r)=I(X;Y) for rH(X)r\geq H(X).

  • 0𝖨𝖡(R)min{R,I(X;Y)}0\leq\mathsf{IB}(R)\leq\min\{R,I(X;Y)\} for any R0R\geq 0 and 𝖯𝖥(r)max{rH(X|Y),0}\mathsf{PF}(r)\geq\max\{r-H(X|Y),0\} for any r0r\geq 0.

  • R𝖨𝖡(R)R\mapsto\mathsf{IB}(R) is continuous, strictly increasing, and concave on the range (0,I(X;Y))(0,I(X;Y)).

  • r𝖯𝖥(r)r\mapsto\mathsf{PF}(r) is continuous, strictly increasing, and convex on the range (0,I(X;Y))(0,I(X;Y)).

  • If PY|X(y|x)>0P_{Y|X}(y|x)>0 for all x𝒳x\in{\mathcal{X}} and y𝒴y\in{\mathcal{Y}}, then both R𝖨𝖡(R)R\mapsto\mathsf{IB}(R) and r𝖯𝖥(r)r\mapsto\mathsf{PF}(r) are continuously differentiable over (0,H(X))(0,H(X)).

  • R𝖨𝖡(R)RR\mapsto\frac{\mathsf{IB}(R)}{R} is non-increasing and r𝖯𝖥(r)rr\mapsto\frac{\mathsf{PF}(r)}{r} is non-decreasing.

  • We have

    𝖨𝖡(R)supPT|X:YXTI(X;T)=RI(Y;T),and𝖯𝖥(r)infPT|X:YXTI(X;T)=rI(Y;T).\mathsf{IB}(R)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)=R\end{subarray}}I(Y;T),\qquad\text{and}\qquad\mathsf{PF}(r)\coloneqq\inf_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ I(X;T)=r\end{subarray}}I(Y;T).

According to this theorem, we can always restrict both RR and rr in (4) and (5), respectively, to [0,H(X)][0,H(X)] as 𝖨𝖡(R)=𝖯𝖥(r)=I(X;Y)\mathsf{IB}(R)=\mathsf{PF}(r)=I(X;Y) for all r,RH(X)r,R\geq H(X).

Define =(PXY)2{\mathcal{M}}={\mathcal{M}}(P_{XY})\subset\mathbb{R}^{2} as

{(I(X;T),I(Y;T)):YXT,(X,Y)PXY}.{\mathcal{M}}\coloneqq\big{\{}(I(X;T),I(Y;T)):Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T,(X,Y)\sim P_{XY}\big{\}}. (6)

It can be directly verified that {\mathcal{M}} is convex. According to this theorem, R𝖨𝖡(R)R\mapsto\mathsf{IB}(R) and r𝖯𝖥(r)r\mapsto\mathsf{PF}(r) correspond to the upper and lower boundary of {\mathcal{M}}, respectively. The convexity of {\mathcal{M}} then implies the concavity and convexity of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. Fig. 1 illustrates the set {\mathcal{M}} for the simple case of binary XX and YY.

While both 𝖨𝖡(0)=0\mathsf{IB}(0)=0 and 𝖯𝖥(0)=0\mathsf{PF}(0)=0, their behavior in the neighborhood around zero might be completely different. As illustrated in Fig. 1, 𝖨𝖡(R)>0\mathsf{IB}(R)>0 for all R>0R>0, whereas 𝖯𝖥(r)=0\mathsf{PF}(r)=0 for r[0,r0]r\in[0,r_{0}] for some r0>0r_{0}>0. When such r0>0r_{0}>0 exists, we say perfect privacy occurs: there exists a variable TT satisfying YXTY\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T such that I(Y;T)=0I(Y;T)=0 while I(X;T)>0I(X;T)>0; making TT a representation of XX having perfect privacy (i.e., no information leakage about YY). A necessary and sufficient condition for the existence of such TT is given in [Asoode_submitted, Lemma 10] and [Calmon_fundamental-Limit, Theorem 3], described next.

Theorem 2 (Perfect privacy).

Let (X,Y)PXY(X,Y)\sim P_{XY} be given and 𝒜[0,1]|𝒴|{\mathcal{A}}\subset[0,1]^{|{\mathcal{Y}}|} be the set of vectors {PY|X(|x),x𝒳}\{P_{Y|X}(\cdot|x),x\in{\mathcal{X}}\}. Then there exists r0>0r_{0}>0 such that 𝖯𝖥(r)=0\mathsf{PF}(r)=0 for r[0,r0]r\in[0,r_{0}] if and only if vectors in 𝒜{\mathcal{A}} are linearly independent.

In light of this theorem, we obtain that perfect privacy occurs if |𝒳|>|𝒴||{\mathcal{X}}|>|{\mathcal{Y}}|. It also follows from the theorem that for binary XX, perfect privacy cannot occur (see Fig. 1(a)).

Theorem 1 enables us to derive a simple bounds for 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. Specifically, the facts that 𝖯𝖥(r)r\frac{\mathsf{PF}(r)}{r} is non-decreasing and 𝖨𝖡(R)R\frac{\mathsf{IB}(R)}{R} is non-increasing immediately result in the the following linear bounds.

Theorem 3 (Linear lower bound).

For r,R(0,H(X))r,R\in(0,H(X)), we have

infQX𝒫(𝒳)QXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX)𝖯𝖥(r)rI(X;Y)H(X)𝖨𝖡(R)RsupQX𝒫(𝒳)QXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX)1.\inf_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}\leq\frac{\mathsf{PF}(r)}{r}\leq\frac{I(X;Y)}{H(X)}\leq\frac{\mathsf{IB}(R)}{R}\leq\sup_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}\leq 1. (7)

In light of this theorem, if 𝖯𝖥(r)=r\mathsf{PF}(r)=r, then I(X;Y)=H(X)I(X;Y)=H(X), implying X=g(Y)X=g(Y) for a deterministic function gg. Conversely, if X=g(Y)X=g(Y) then 𝖯𝖥(r)=r\mathsf{PF}(r)=r because for all TT forming the Markov relation Yg(Y)TY\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}g(Y)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T, we have I(Y;T)=I(g(Y);T)I(Y;T)=I(g(Y);T). On the other hand, we have 𝖨𝖡(R)=R\mathsf{IB}(R)=R if and only if there exists a variable TT^{*} satisfying I(X;T)=I(Y;T)I(X;T^{*})=I(Y;T^{*}) and thus the following double Markov relations

YXT,andXYT.Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*},\qquad\text{and}\qquad X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*}.

It can be verified (see [csiszarbook, Problem 16.25]) that this double Markov condition is equivalent to the existence of a pair of functions ff and gg such that f(X)=g(Y)f(X)=g(Y) and (X,Y)f(X)T(X,Y)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}f(X)\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T^{*}. One special case of this setting, namely where gg is an identity function, has been recently studied in details in [kolchinsky2018caveats] and will be reviewed in Section II-E. Theorem 3 also enables us to characterize the "worst" joint distribution PXYP_{XY} with respect to 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. As demonstrated in the following lemma, if PY|XP_{Y|X} is an erasure channel then 𝖯𝖥(r)r=𝖨𝖡(R)R=I(X;Y)H(X)\frac{\mathsf{PF}(r)}{r}=\frac{\mathsf{IB}(R)}{R}=\frac{I(X;Y)}{H(X)}.

Lemma 1.
  • Let PXYP_{XY} be such that 𝒴=𝒳{}{\mathcal{Y}}={\mathcal{X}}\cup\{\perp\}, PY|X(x|x)=1δP_{Y|X}(x|x)=1-\delta, and PY|X(|x)=δP_{Y|X}(\perp|x)=\delta for some δ>0\delta>0. Then

    𝖯𝖥(r)r=𝖨𝖡(R)R=1δ.\frac{\mathsf{PF}(r)}{r}=\frac{\mathsf{IB}(R)}{R}=1-\delta.
  • Let PXYP_{XY} be such that 𝒳=𝒴{}{\mathcal{X}}={\mathcal{Y}}\cup\{\perp\}, PX|Y(y|y)=1δP_{X|Y}(y|y)=1-\delta, and PX|Y(|y)=δP_{X|Y}(\perp|y)=\delta for some δ>0\delta>0. Then

    𝖯𝖥(r)=max{rH(X|Y),0}.\mathsf{PF}(r)=\max\{r-H(X|Y),0\}.

The bounds in Theorem 3 hold for all rr and RR in the interval [0,H(X)][0,H(X)]. We can, however, improve them when rr and RR are sufficiently small. Let 𝖯𝖥(0)\mathsf{PF}^{\prime}(0) and 𝖨𝖡(0)\mathsf{IB}^{\prime}(0) denote the slope of 𝖯𝖥()\mathsf{PF}(\cdot) and 𝖨𝖡()\mathsf{IB}(\cdot) at zero, i.e., 𝖯𝖥(0)limr0+𝖯𝖥(r)r\mathsf{PF}^{\prime}(0)\coloneqq\lim_{r\to 0^{+}}\frac{\mathsf{PF}(r)}{r} and 𝖨𝖡(0)limR0+𝖨𝖡(R)R\mathsf{IB}^{\prime}(0)\coloneqq\lim_{R\to 0^{+}}\frac{\mathsf{IB}(R)}{R}.

Theorem 4.

Given (X,Y)PXY(X,Y)\sim P_{XY}, we have

infQX𝒫(𝒳)QXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX)=𝖯𝖥(0)\displaystyle\inf_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}=\mathsf{PF}^{\prime}(0) minx𝒳:PX(x)>0D𝖪𝖫(PY|X(|x)PY())logPX(x)\displaystyle\leq\min_{\begin{subarray}{c}x\in{\mathcal{X}}:\\ P_{X}(x)>0\end{subarray}}\frac{D_{\mathsf{KL}}(P_{Y|X}(\cdot|x)\|P_{Y}(\cdot))}{-\log P_{X}(x)}
maxx𝒳:PX(x)>0D𝖪𝖫(PY|X(|x)PY())logPX(x)𝖨𝖡(0)=supQX𝒫(𝒳)QXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX).\displaystyle\leq\max_{\begin{subarray}{c}x\in{\mathcal{X}}:\\ P_{X}(x)>0\end{subarray}}\frac{D_{\mathsf{KL}}(P_{Y|X}(\cdot|x)\|P_{Y}(\cdot))}{-\log P_{X}(x)}\leq\mathsf{IB}^{\prime}(0)=\sup_{Q_{X}\in{\mathcal{P}}({\mathcal{X}})\atop Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}.

This theorem provides the exact values of 𝖯𝖥(0)\mathsf{PF}^{\prime}(0) and 𝖨𝖡(0)\mathsf{IB}^{\prime}(0) and also simple bounds for them. Although the exact expressions for 𝖯𝖥(0)\mathsf{PF}^{\prime}(0) and 𝖨𝖡(0)\mathsf{IB}^{\prime}(0) are usually difficult to compute, a simple plug-in estimator is proposed in [Hypercontractivity_NIPS2017] for 𝖨𝖡(0)\mathsf{IB}^{\prime}(0). This estimator can be readily adapted to estimate 𝖯𝖥(0)\mathsf{PF}^{\prime}(0). Theorem 4 reveals a profound connection between 𝖨𝖡\mathsf{IB} and the strong data processing inequality (SDPI) [Ahlswede_Gacs]. More precisely, thanks to the pioneering work of Anantharam et al. [anantharam], it is known that the supremum of D𝖪𝖫(QYPY)D𝖪𝖫(QXPX)\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})} over all QXPXQ_{X}\neq P_{X} is equal the supremum of I(Y;T)I(X;T)\frac{I(Y;T)}{I(X;T)} over all PT|XP_{T|X} satisfying YXTY\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T and hence 𝖨𝖡(0)\mathsf{IB}^{\prime}(0) specifies the strengthening of the data processing inequality of mutual information. This connection may open a new avenue for new theoretical results for 𝖨𝖡\mathsf{IB}, especially when XX or YY are continuous random variables. In particular, the recent non-multiplicative SDPI results [Polyanskiy, calmon2015strong] seem insightful for this purpose.

In many practical cases, we might have nn i.i.d. samples (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) of (X,Y)PXY(X,Y)\sim P_{XY}. We now study how 𝖨𝖡\mathsf{IB} behaves in nn. Let Xn(X1,,Xn)X^{n}\coloneqq(X_{1},\dots,X_{n}) and Yn(Y1,,Yn)Y^{n}\coloneqq(Y_{1},\dots,Y_{n}). Due to the i.i.d. assumption, we have PXnYn(xn,yn)=i=1nPXY(xi,yi)P_{X^{n}Y^{n}}(x^{n},y^{n})=\prod_{i=1}^{n}P_{XY}(x_{i},y_{i}). This can also be described by independently feeding XiX_{i}, i[n]i\in[n], to channel PY|XP_{Y|X} producing YiY_{i}. The following theorem, demonstrated first in [Witsenhausen_Wyner, Theorem 2.4], gives a formula for 𝖨𝖡\mathsf{IB} in terms of nn.

Theorem 5 (Additivity).

We have

1n𝖨𝖡(PXnYn,nR)=𝖨𝖡(PXY,R).\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)=\mathsf{IB}(P_{XY},R).

This theorem demonstrates that an optimal channel PTn|XnP_{T^{n}|X^{n}} for i.i.d. samples (Xn,Yn)PXY(X^{n},Y^{n})\sim P_{XY} is obtained by the Kronecker product of an optimal channel PT|XP_{T|X} for (X,Y)PXY(X,Y)\sim P_{XY}. This, however, may not hold in general for 𝖯𝖥\mathsf{PF}, that is, we might have 𝖯𝖥(PXnYn,nr)<n𝖯𝖥(PXY,r)\mathsf{PF}(P_{X^{n}Y^{n}},nr)<n\mathsf{PF}(P_{XY},r), see [Calmon_fundamental-Limit, Proposition 1] for an example.

II-A Gaussian 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

In this section, we turn our attention to a special, yet important, case where X=Y+σN𝖦X=Y+\sigma N^{\mathsf{G}}, where σ>0\sigma>0 and N𝖦𝒩(0,1)N^{\mathsf{G}}\sim{\mathcal{N}}(0,1) is independent of YY. This setting subsumes the popular case of jointly Gaussian (X,Y)(X,Y) whose information bottleneck functional was computed in [Bottleneck_Gaussian] for the vector case (i.e., (X,Y)(X,Y) are jointly Gaussian random vectors).

Lemma 2.

Let {Yi}i=1n\{Y_{i}\}_{i=1}^{n} be nn i.i.d. copies of YPYY\sim P_{Y} and Xi=Yi+σNi𝖦X_{i}=Y_{i}+\sigma N_{i}^{\mathsf{G}} where {N𝖦i}\{N^{\mathsf{G}}_{i}\} are i.i.d samples of 𝒩(0,1){\mathcal{N}}(0,1) independent of YY. Then, we have

1n𝖨𝖡(PXnYn,nR)H(X)12log[2πeσ2+e2(H(Y)R)].\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq H(X)-\frac{1}{2}\log\left[2\pi e\sigma^{2}+e^{2(H(Y)-R)}\right].

It is worth noting that this result was concurrently proved in [Zaidi_GaussianCase]. The main technical tool in the proof of this lemma is a strong version of the entropy power inequality [EPI_Courtade, Theorem 2] which holds even if XiX_{i}, YiY_{i}, and NiN_{i} are random vectors (as opposed to scalar). Thus, one can readily generalize Lemma 2 to the vector case. Note that the upper bound established in this lemma holds without any assumptions on PT|XP_{T|X}. This upper bound provides a significantly simpler proof for the well-known fact that for the jointly Gaussian (X,Y)(X,Y), the optimal channel PT|XP_{T|X} is Gaussian. This result was first proved in [GaussianIB] and used in [Bottleneck_Gaussian] to compute an expression of 𝖨𝖡\mathsf{IB} for the Gaussian case.

Corollary 1.

If (X,Y)(X,Y) are jointly Gaussian with correlation coefficient ρ\rho, then we have

𝖨𝖡(R)=12log11ρ2+ρ2e2R.\mathsf{IB}(R)=\frac{1}{2}\log\frac{1}{1-\rho^{2}+\rho^{2}e^{-2R}}. (8)

Moreover, the optimal channel PT|XP_{T|X} is given by PT|X(|x)=𝒩(0,σ~2)P_{T|X}(\cdot|x)={\mathcal{N}}(0,\tilde{\sigma}^{2}) for σ~2=σY2e2Rρ2(1e2R)\tilde{\sigma}^{2}=\sigma_{Y}^{2}\frac{e^{-2R}}{\rho^{2}(1-e^{-2R})} where σY2\sigma_{Y}^{2} is the variance of YY.

In Lemma 2, we assumed that XX is a Gaussian perturbation of YY. However, in some practical scenarios, we might have YY as a Gaussian perturbation of XX. For instance, let XX represent an image and YY be a feature of the image that can be perfectly obtained from a noisy observation of XX. Then, the goal is to compress the image with a given compression rate while retaining maximal information about the feature. The following lemma, which is an immediate consequence of [calmon2015strong, Theorem 1], gives an upper bound for 𝖨𝖡\mathsf{IB} in this case.

Lemma 3.

Let XnX^{n} be nn i.i.d. copies of a random variable XX satisfying 𝔼[X2]1{\mathbb{E}}[X^{2}]\leq 1 and YiY_{i} be the result of passing XiX_{i}, i[n]i\in[n], through a Gaussian channel Y=X+σN𝖦Y=X+\sigma N^{\mathsf{G}}, where σ>0\sigma>0 and N𝖦𝒩(0,1)N^{\mathsf{G}}\sim{\mathcal{N}}(0,1) is independent of XX. Then, we have

1n𝖨𝖡(PXnYn,nR)RΨ(R,σ),\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq R-\Psi(R,\sigma), (9)

where

Ψ(R,σ)maxx[0,12]2𝖰(1xσ2)(Rh𝖻(x)x2log(1+1xσ2)),\Psi(R,\sigma)\coloneqq\max_{x\in[0,\frac{1}{2}]}2\mathsf{Q}\left(\sqrt{\frac{1}{x\sigma^{2}}}\right)\left(R-h_{\mathsf{b}}(x)-\frac{x}{2}\log\left(1+\frac{1}{x\sigma^{2}}\right)\right), (10)

𝖰(t)t12πet22dt\mathsf{Q}(t)\coloneqq\int_{t}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^{2}}{2}}\textnormal{d}t is the Gaussian complimentary CDF and h𝖻(a)alog(a)(1a)log(1a)h_{\mathsf{b}}(a)\coloneqq-a\log(a)-(1-a)\log(1-a) for a(0,1)a\in(0,1) is the binary entropy function. Moreover, we have

1n𝖨𝖡(PXnYn,nR)Re1Rσ2log1R+Θ(log1R).\frac{1}{n}\mathsf{IB}(P_{X^{n}Y^{n}},nR)\leq R-e^{-\frac{1}{R\sigma^{2}}\log\frac{1}{R}+\Theta\left(\log\frac{1}{R}\right)}. (11)

Note that that Lemma 3 holds for any arbitrary XX (provided that 𝔼[X2]1{\mathbb{E}}[X^{2}]\leq 1) and hence (9) bounds information bottleneck functionals for a wide family of PXYP_{XY}. However, the bound is loose in general for large values of RR. For instance, if (X,Y)(X,Y) are jointly Gaussian (implying Y=X+σN𝖦Y=X+\sigma N^{\mathsf{G}} for some σ>0\sigma>0), then the right-hand side of (9) does not reduce to (8). To show this, we numerically compute the upper bound (9) and compare it with the Gaussian information bottleneck (8) in Fig. 2.

Refer to caption
Figure 2: Comparison of (8), the exact value of 𝖨𝖡\mathsf{IB} for jointly Gaussian XX and YY (i.e., Y=X+σN𝖦Y=X+\sigma N^{\mathsf{G}} with XX and N𝖦N^{\mathsf{G}} being both standard Gaussian 𝒩(0,1){\mathcal{N}}(0,1)), with the general upper bound (9) for σ2=0.5\sigma^{2}=0.5. It is worth noting that while the Gaussian 𝖨𝖡\mathsf{IB} converges to I(X;Y)0.8I(X;Y)\approx 0.8, the upper bound diverges.

The privacy funnel functional is much less studied even for the simple case of jointly Gaussian. Solving the optimization in 𝖯𝖥\mathsf{PF} over PT|XP_{T|X} without any assumptions is a difficult challenge. A natural assumption to make is that PT|X(|x)P_{T|X}(\cdot|x) is Gaussian for each x𝒳x\in{\mathcal{X}}. This leads to the following variant of 𝖯𝖥\mathsf{PF}

𝖯𝖥𝖦(r)infσ0,I(X;Tσ)rI(Y;Tσ),\mathsf{PF}^{\mathsf{G}}(r)\coloneqq\inf_{\begin{subarray}{c}\sigma\geq 0,\\ I(X;T_{\sigma})\geq r\end{subarray}}I(Y;T_{\sigma}),

where

TσX+σN𝖦,T_{\sigma}\coloneqq X+\sigma N^{\mathsf{G}},

and N𝖦𝒩(0,1)N^{\mathsf{G}}\sim{\mathcal{N}}(0,1) is independent of XX. This formulation is tractable and can be computed in closed form for jointly Gaussian (X,Y)(X,Y) as described in the following example.

Example 1. Let XX and YY be jointly Gaussian with correlation coefficient ρ\rho. First note that since mutual information is invariant to scaling, we may assume without loss of generality that both XX and YY are zero mean and unit variance and hence we can write X=ρY+1ρ2M𝖦X=\rho Y+\sqrt{1-\rho^{2}}M^{\mathsf{G}} where M𝖦𝒩(0,1)M^{\mathsf{G}}\sim{\mathcal{N}}(0,1) is independent of YY. Consequently, we have

I(X;Tσ)=12log(1+1σ2),I(X;T_{\sigma})=\frac{1}{2}\log\left(1+\frac{1}{\sigma^{2}}\right), (12)

and

I(Y;Tσ)=12log(1+ρ21ρ2+σ2).I(Y;T_{\sigma})=\frac{1}{2}\log\left(1+\frac{\rho^{2}}{1-\rho^{2}+\sigma^{2}}\right). (13)

In order to ensure I(X;Tσ)rI(X;T_{\sigma})\geq r, we must have σ(e2r1)12\sigma\leq\left(e^{2r}-1\right)^{-\frac{1}{2}}. Plugging this choice of σ\sigma into (13), we obtain

𝖯𝖥𝖦(r)=12log(11ρ2(1e2r)).\mathsf{PF}^{\mathsf{G}}(r)=\frac{1}{2}\log\left(\frac{1}{1-\rho^{2}\left(1-e^{-2r}\right)}\right). (14)

This example indicates that for jointly Gaussian (X,Y)(X,Y), we have 𝖯𝖥𝖦(r)=0\mathsf{PF}^{\mathsf{G}}(r)=0 if and only if r=0r=0 (thus perfect privacy does not occur) and the constraint I(X;Tσ)=rI(X;T_{\sigma})=r is satisfied by a unique σ\sigma. These two properties in fact hold for all continuous variables XX and YY with finite second moments as demonstrated in Lemma LABEL:Lemma:StrictCon_IYT in Appendix LABEL:Appendix_ProofSecIB_PF. We use these properties to derive a second-order approximation of 𝖯𝖥𝖦(r)\mathsf{PF}^{\mathsf{G}}(r) when rr is sufficiently small. For the following theorem, we use 𝗏𝖺𝗋(U){\mathsf{var}}(U) to denote the variance of the random variable UU and 𝗏𝖺𝗋(U|V)𝔼[(U𝔼[U|V])2|V]{\mathsf{var}}(U|V)\coloneqq{\mathbb{E}}[(U-{\mathbb{E}}[U|V])^{2}|V]. We use σ2X=𝗏𝖺𝗋(X)\sigma^{2}_{X}={\mathsf{var}}(X) for short.

Theorem 6.

For any pair of continuous random variables (X,Y)(X,Y) with finite second moments, we have as r0r\to 0

𝖯𝖥𝖦(r)=η(X,Y)r+Δ(X,Y)r2+o(r2),\mathsf{PF}^{\mathsf{G}}(r)=\eta(X,Y)r+\Delta(X,Y)r^{2}+o(r^{2}),

where η(X,Y)𝗏𝖺𝗋(𝔼[X|Y])σX2\eta(X,Y)\coloneqq\frac{{\mathsf{var}}({\mathbb{E}}[X|Y])}{\sigma_{X}^{2}} and

Δ(X,Y)2σ4X[𝔼[𝗏𝖺𝗋2(X|Y)]σX2𝔼[𝗏𝖺𝗋(X|Y)]].\Delta(X,Y)\coloneqq\frac{2}{\sigma^{4}_{X}}\left[{\mathbb{E}}[{\mathsf{var}}^{2}(X|Y)]-\sigma_{X}^{2}{\mathbb{E}}[{\mathsf{var}}(X|Y)]\right].

It is worth mentioning that the quantity η(X,Y)\eta(X,Y) was first defined by Rényi [Renyi-dependence-measure] as an asymmetric measure of correlation between XX and YY. In fact, it can be shown that η(X,Y)=supfρ2(X,f(Y)),\eta(X,Y)=\sup_{f}\rho^{2}(X,f(Y)), where supremum is taken over all measurable functions ff and ρ(,)\rho(\cdot,\cdot) denotes the correlation coefficient. As a simple illustration of Theorem 6, consider jointly Gaussian XX and YY with correlation coefficient ρ\rho for which 𝖯𝖥𝖦\mathsf{PF}^{\mathsf{G}} was computed in Example 2. In this case, it can be easily verified that η(X,Y)=ρ2\eta(X,Y)=\rho^{2} and Δ(X,Y)=2σX2ρ2(1ρ2)\Delta(X,Y)=-2\sigma_{X}^{2}\rho^{2}(1-\rho^{2}). Hence, for jointly Gaussian (X,Y)(X,Y) with correlation coefficient ρ\rho and unit variance, we have 𝖯𝖥𝖦(r)=ρ2r2ρ2(1ρ2)r2+o(r2)\mathsf{PF}^{\mathsf{G}}(r)=\rho^{2}r-2\rho^{2}(1-\rho^{2})r^{2}+o(r^{2}). In Fig. 3, we compare the approximation given in Theorem 6 for this particular case.

Refer to caption
Figure 3: Second-order approximation of 𝖯𝖥𝖦\mathsf{PF}^{\mathsf{G}} according to Theorem 6 for jointly Gaussian XX and YY with correlation coefficient ρ=0.8\rho=0.8. For this particular case, the exact expression of 𝖯𝖥𝖦\mathsf{PF}^{\mathsf{G}} is computed in (14).

II-B Evaluation of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

The constrained optimization problems in the definitions of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are usually challenging to solve numerically due to the non-linearity in the constraints. In practice, however, both 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} are often approximated by their corresponding Lagrangian optimizations

𝖨𝖡(β)supPT|XI(Y;T)βI(X;T)=H(Y)βH(X)infPT|X[H(Y|T)βH(X|T)],{\mathcal{L}}_{\mathsf{IB}}(\beta)\coloneqq\sup_{P_{T|X}}I(Y;T)-\beta I(X;T)=H(Y)-\beta H(X)-\inf_{P_{T|X}}\left[H(Y|T)-\beta H(X|T)\right], (15)

and

𝖯𝖥(β)infPT|XI(Y;T)βI(X;T)=H(Y)βH(X)supPT|X[H(Y|T)βH(X|T)],{\mathcal{L}}_{\mathsf{PF}}(\beta)\coloneqq\inf_{P_{T|X}}I(Y;T)-\beta I(X;T)=H(Y)-\beta H(X)-\sup_{P_{T|X}}\left[H(Y|T)-\beta H(X|T)\right], (16)

where β+\beta\in\mathbb{R}_{+} is the Lagrangian multiplier that controls the tradeoff between compression and informativeness in for 𝖨𝖡\mathsf{IB} and the privacy and informativeness in 𝖯𝖥\mathsf{PF}. Notice that for the computation of 𝖨𝖡{\mathcal{L}}_{\mathsf{IB}}, we can assume, without loss of generality, that β[0,1]\beta\in[0,1] since otherwise the maximizer of (15) is trivial. It is worth noting that 𝖨𝖡(β){\mathcal{L}}_{\mathsf{IB}}(\beta) and 𝖯𝖥(β){\mathcal{L}}_{\mathsf{PF}}(\beta) in fact correspond to lines of slope β\beta supporting {\mathcal{M}} from above and below, thereby providing a new representation of {\mathcal{M}}.

Let (X,Y)(X^{\prime},Y^{\prime}) be a pair of random variables with XQXX^{\prime}\sim Q_{X} for some QX𝒫(𝒳)Q_{X}\in{\mathcal{P}}({\mathcal{X}}) and YY^{\prime} is the output of PY|XP_{Y|X} when the input is XX^{\prime} (i.e., YQXPY|XY^{\prime}\sim Q_{X}P_{Y|X}). Define

Fβ(QX)H(Y)βH(X).F_{\beta}(Q_{X})\coloneqq H(Y^{\prime})-\beta H(X^{\prime}).

This function, in general, is neither convex nor concave in QXQ_{X}. For instance, F(0)F(0) is concave and F(1)F(1) is convex in PXP_{X}. The lower convex envelope (resp. upper concave envelope) of Fβ(QX)F_{\beta}(Q_{X}) is defined as the largest (resp. smallest) convex (resp. concave) smaller (larger) than Fβ(QX)F_{\beta}(Q_{X}). Let 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})] and 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})] denote the lower convex and upper concave envelopes of Fβ(QX)F_{\beta}(Q_{X}), respectively. If Fβ(QX)F_{\beta}(Q_{X}) is convex at PXP_{X}, that is 𝒦[Fβ(QX)]|PX=Fβ(PX)\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}=F_{\beta}(P_{X}), then Fβ(QX)F_{\beta}(Q_{X}) remains convex at PXP_{X} for all ββ\beta^{\prime}\geq\beta because

𝒦[Fβ(QX)]\displaystyle\mathcal{K}_{\mathsf{\cup}}[F_{\beta^{\prime}}(Q_{X})] =𝒦[Fβ(QX)(ββ)H(X)]\displaystyle=\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})-(\beta^{\prime}-\beta)H(X^{\prime})]
𝒦[Fβ(QX)]+𝒦[(ββ)H(X)]\displaystyle\geq\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]+\mathcal{K}_{\mathsf{\cup}}[-(\beta^{\prime}-\beta)H(X^{\prime})]
=𝒦[Fβ(QX)](ββ)H(X),\displaystyle=\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]-(\beta^{\prime}-\beta)H(X^{\prime}),

where the last equality follows from the fact that (ββ)H(X)-(\beta^{\prime}-\beta)H(X) is convex. Hence, at PXP_{X} we have

𝒦[Fβ(QX)]|PX𝒦[Fβ(QX)]|PX(ββ)H(X)=Fβ(PX)(ββ)H(X)=Fβ(PX).\mathcal{K}_{\mathsf{\cup}}[F_{\beta^{\prime}}(Q_{X})]\big{|}_{P_{X}}\geq\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}-(\beta^{\prime}-\beta)H(X)=F_{\beta}(P_{X})-(\beta^{\prime}-\beta)H(X)=F_{\beta^{\prime}}(P_{X}).

Analogously, if Fβ(QX)F_{\beta}(Q_{X}) is concave at PXP_{X}, that is 𝒦[Fβ(QX)]|PX=Fβ(PX)\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}=F_{\beta}(P_{X}), then Fβ(QX)F_{\beta}(Q_{X}) remains concave at PXP_{X} for all ββ\beta^{\prime}\leq\beta.

Notice that, according to (15) and (16), we can write

𝖨𝖡(β)=H(Y)βH(X)𝒦[Fβ(QX)]|PX,{\mathcal{L}}_{\mathsf{IB}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}, (17)

and

𝖯𝖥(β)=H(Y)βH(X)𝒦[Fβ(QX)]|PX.{\mathcal{L}}_{\mathsf{PF}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}}. (18)

In light of the above arguments, we can write

𝖨𝖡(β)=0,{\mathcal{L}}_{\mathsf{IB}}(\beta)=0,

for all β>β𝖨𝖡\beta>\beta_{\mathsf{IB}} where β𝖨𝖡\beta_{\mathsf{IB}} is the smallest β\beta such that Fβ(PX)F_{\beta}(P_{X}) touches 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]. Similarly,

𝖯𝖥(β)=0,{\mathcal{L}}_{\mathsf{PF}}(\beta)=0,

for all β<β𝖯𝖥\beta<\beta_{\mathsf{PF}} where β𝖯𝖥\beta_{\mathsf{PF}} is the largest β\beta such that Fβ(PX)F_{\beta}(P_{X}) touches 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]. In the following theorem, we show that β𝖨𝖡\beta_{\mathsf{IB}} and β𝖯𝖥\beta_{\mathsf{PF}} are given by the values of 𝖨𝖡(0)\mathsf{IB}^{\prime}(0) and 𝖯𝖥(0)\mathsf{PF}^{\prime}(0), respectively, given in Theorem 4. A similar formulae β𝖨𝖡\beta_{\mathsf{IB}} and β𝖯𝖥\beta_{\mathsf{PF}} were given in [Learability_2019].

Proposition 1.

We have,

β𝖨𝖡=supQXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX),\beta_{\mathsf{IB}}=\sup_{Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})},

and

β𝖯𝖥=infQXPXD𝖪𝖫(QYPY)D𝖪𝖫(QXPX).\beta_{\mathsf{PF}}=\inf_{Q_{X}\neq P_{X}}\frac{D_{\mathsf{KL}}(Q_{Y}\|P_{Y})}{D_{\mathsf{KL}}(Q_{X}\|P_{X})}.

Kim et al. [Hypercontractivity_NIPS2017] have recently proposed an efficient algorithm to estimate β𝖨𝖡\beta_{\mathsf{IB}} from samples of PXYP_{XY} involving a simple optimization problem. This algorithm can be readily adapted for estimating β𝖯𝖥\beta_{\mathsf{PF}}. Proposition 1 implies that in optimizing the Lagrangians (17) and (18), we can restrict the Lagrange multiplier β\beta, that is

𝖨𝖡(β)=H(Y)βH(X)𝒦[Fβ(QX)]|PX,forβ[0,β𝖨𝖡],{\mathcal{L}}_{\mathsf{IB}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}},\qquad\text{for}\qquad\beta\in[0,\beta_{\mathsf{IB}}], (19)

and

𝖯𝖥(β)=H(Y)βH(X)𝒦[Fβ(QX)]|PX,forβ[β𝖯𝖥,).{\mathcal{L}}_{\mathsf{PF}}(\beta)=H(Y)-\beta H(X)-\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]\big{|}_{P_{X}},\qquad\text{for}\qquad\beta\in[\beta_{\mathsf{PF}},\infty). (20)
Remark 1.

As demonstrated by Kolchinsky et al. [kolchinsky2018caveats], the boundary points 0 and β𝖨𝖡\beta_{\mathsf{IB}} are required for the computation of 𝖨𝖡(β){\mathcal{L}}_{\mathsf{IB}}(\beta). In fact, when YY is a deterministic function of XX, then only β=0\beta=0 and β=β𝖨𝖡\beta=\beta_{\mathsf{IB}} are required to compute the 𝖨𝖡\mathsf{IB} and other values of β\beta are vacuous. The same argument can also be used to justify the inclusion of β𝖯𝖥\beta_{\mathsf{PF}} in computing 𝖯𝖥(β){\mathcal{L}}_{\mathsf{PF}}(\beta). Note also that since Fβ(QX)F_{\beta}(Q_{X}) becomes convex for β>β𝖨𝖡\beta>\beta_{\mathsf{IB}}, computing 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})] becomes trivial for such values of β\beta.

Remark 2.

Observe that the lower convex envelope of any function ff can be obtained by taking Legendre-Fenchel transformation (aka. convex conjugate) twice. Hence, one can use the existing linear-time algorithms for approximating Legendre-Fenchel transformation (e.g., [Legendre_transformation_alg, Legendre_transformation_alg2]) for approximating 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})].

Once 𝖨𝖡(β){\mathcal{L}}_{\mathsf{IB}}(\beta) and 𝖯𝖥(β){\mathcal{L}}_{\mathsf{PF}}(\beta) are computed, we can derive 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} via standard results in optimization (see [Witsenhausen_Wyner, Section IV] for more details):

𝖨𝖡(R)=infβ[0,β𝖨𝖡]βR+𝖨𝖡(β),\mathsf{IB}(R)=\inf_{\beta\in[0,\beta_{\mathsf{IB}}]}\beta R+{\mathcal{L}}_{\mathsf{IB}}(\beta), (21)

and

𝖯𝖥(r)=supβ[β𝖯𝖥,]βr+𝖯𝖥(β).\mathsf{PF}(r)=\sup_{\beta\in[\beta_{\mathsf{PF}},\infty]}\beta r+{\mathcal{L}}_{\mathsf{PF}}(\beta). (22)

Following the convex analysis approach outlined by Witsenhausen and Wyner [Witsenhausen_Wyner], 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} can be directly computed from 𝖨𝖡(β){\mathcal{L}}_{\mathsf{IB}}(\beta) and 𝖯𝖥(β){\mathcal{L}}_{\mathsf{PF}}(\beta) by observing the following. Suppose for some β\beta, 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})] (resp. 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]) at PXP_{X} is obtained by a convex combination of points Fβ(Qi)F_{\beta}(Q^{i}), i[k]i\in[k] for some Q1,,QkQ^{1},\dots,Q^{k} in 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}), integer k2k\geq 2, and weights λi0\lambda_{i}\geq 0 (with iλi=1\sum_{i}\lambda_{i}=1). Then iλiQi=PX\sum_{i}\lambda_{i}Q^{i}=P_{X}, and TT^{*} with properties PT(i)=λiP_{T^{*}}(i)=\lambda_{i} and PX|T=i=QiP_{X|T^{*}=i}=Q^{i} attains the minimum (resp. maximum) of H(Y|T)βH(X|T)H(Y|T)-\beta H(X|T). Hence, (I(X;T),I(Y;T))(I(X;T^{*}),I(Y;T^{*})) is a point on the upper (resp. lower) boundary of {\mathcal{M}}; implying that 𝖨𝖡(R)=I(Y;T)\mathsf{IB}(R)=I(Y;T^{*}) for R=I(X;T)R=I(X;T^{*}) (resp. 𝖯𝖥(r)=I(Y;T)\mathsf{PF}(r)=I(Y;T^{*}) for r=I(X;T)r=I(X;T^{*})). If for some β\beta, 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})] at PXP_{X} coincides with Fβ[PX]F_{\beta}[P_{X}], then this corresponds to 𝖨𝖡(β)=0{\mathcal{L}}_{\mathsf{IB}}(\beta)=0. The same holds for 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})]. Thus, all the information about the functional 𝖨𝖡\mathsf{IB} (resp. 𝖯𝖥\mathsf{PF}) is contained in the subset of the domain of 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cup}}[F_{\beta}(Q_{X})] (resp. 𝒦[Fβ(QX)]\mathcal{K}_{\mathsf{\cap}}[F_{\beta}(Q_{X})]) over which it differs from Fβ(QX)F_{\beta}(Q_{X}). We will revisit and generalize this approach later in Section III.

We can now instantiate this for the binary symmetric case. Suppose XX and YY are binary variables and PY|XP_{Y|X} is binary symmetric channel with crossover probability δ\delta, denoted by 𝖡𝖲𝖢(δ)\mathsf{BSC}(\delta) and defined as

𝖡𝖲𝖢(δ)=[1δδδ1δ],\mathsf{BSC}(\delta)=\begin{bmatrix}1-\delta&\delta\\ \delta&1-\delta\end{bmatrix}, (23)

for some δ0\delta\geq 0. To describe the result in a compact fashion, we introduce the following notation: we let h𝖻:[0,1][0,1]h_{\mathsf{b}}:[0,1]\to[0,1] denote the binary entropy function, i.e., h𝖻(p)=plogp(1p)log(1p)h_{\mathsf{b}}(p)=-p\log p-(1-p)\log(1-p). Since this function is strictly increasing [0,12][0,\frac{1}{2}], its inverse exists and is denoted by h1𝖻:[0,1][0,12]h^{-1}_{\mathsf{b}}:[0,1]\to[0,\frac{1}{2}]. Also, aba(1b)+b(1a)a*b\coloneqq a(1-b)+b(1-a) for a,b[0,1]a,b\in[0,1].

Lemma 4 (Mr. and Mrs. Gerber’s Lemma).

For X𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(p)X\sim{\mathsf{Bernoulli}}(p) for p12p\leq\frac{1}{2} and PY|X=𝖡𝖲𝖢(δ)P_{Y|X}=\mathsf{BSC}(\delta) for δ0\delta\geq 0, we have

𝖨𝖡(R)=h𝖻(pδ)h𝖻(δh𝖻1(h𝖻(p)R)),\mathsf{IB}(R)=h_{\mathsf{b}}(p*\delta)-h_{\mathsf{b}}\left(\delta*h_{\mathsf{b}}^{-1}\big{(}h_{\mathsf{b}}(p)-R\big{)}\right), (24)

and

𝖯𝖥(r)=h𝖻(pδ)αh𝖻(δpz)α¯h𝖻(δ),\mathsf{PF}(r)=h_{\mathsf{b}}(p*\delta)-\alpha h_{\mathsf{b}}\left(\delta*\frac{p}{z}\right)-\bar{\alpha}h_{\mathsf{b}}\left(\delta\right), (25)

where r=h𝖻(p)αh𝖻(pz)r=h_{\mathsf{b}}(p)-\alpha h_{\mathsf{b}}\left(\frac{p}{z}\right), z=max(α,2p)z=\max\left(\alpha,2p\right), and α[0,1]\alpha\in[0,1].

The result in (24) was proved by Wyner and Ziv [Gerber] and is widely known as Mrs. Gerber’s Lemma in information theory. Due to the similarity, we refer to (25) as Mr. Gerber’s Lemma. As described above, to prove (24) and (25) it suffices to derive the convex and concave envelopes of the mapping Fβ:[0,1]F_{\beta}:[0,1]\to\mathbb{R} given by

Fβ(q)Fβ(QX)=h𝖻(qδ)βh𝖻(q),F_{\beta}(q)\coloneqq F_{\beta}(Q_{X})=h_{\mathsf{b}}(q*\delta)-\beta h_{\mathsf{b}}(q), (26)

where qδqδ¯+δq¯q*\delta\coloneqq q\bar{\delta}+\delta\bar{q} is the output distribution of 𝖡𝖲𝖢(δ)\mathsf{BSC}(\delta) when the input distribution is 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(q){\mathsf{Bernoulli}}(q) for some q(0,1)q\in(0,1). It can be verified that β𝖨𝖡(12δ)2\beta_{\mathsf{IB}}\leq(1-2\delta)^{2}. This function is depicted in Fig. 4 depending of the values of β(12δ)2\beta\leq(1-2\delta)^{2}.

Refer to caption
(a) β=0.55\beta=0.55
Refer to caption
(b) β=0.53\beta=0.53
Refer to caption
(c) β=0.49\beta=0.49
Figure 4: The mapping qFβ(q)=H(Y)βH(X)q\mapsto F_{\beta}(q)=H(Y^{\prime})-\beta H(X^{\prime}) where X𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(q)X^{\prime}\sim{\mathsf{Bernoulli}}(q) and YY^{\prime} is the result of passing XX^{\prime} through 𝖡𝖲𝖢(0.1)\mathsf{BSC}(0.1), see (26).

II-C Operational Meaning of 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}

In this section, we illustrate several information-theoretic settings which shed light on the operational interpretation of both 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. The operational interpretation of 𝖨𝖡\mathsf{IB} has recently been extensively studied in information-theoretic settings in [Bottleneck_Polyanskiy, Bottleneck_Shamai]. In particular, it was shown that 𝖨𝖡\mathsf{IB} specifies the rate-distortion region of noisy source coding problem [Noisy_SourceCoding_Dobrushin, Witsenhusen_Indirect] under the logarithmic loss as the distortion measure and also the rate region of the lossless source coding with side information at the decoder [Wyner_SourceCoding]. Here, we state the former setting (as it will be useful for our subsequent analysis of cardinality bound) and also provide a new information-theoretic setting in which 𝖨𝖡\mathsf{IB} appears as the solution. Then, we describe another setting, the so-called dependence dilution, whose achievable rate region has an extreme point specified by 𝖯𝖥\mathsf{PF}. This in fact delineate an important difference between 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}: while 𝖨𝖡\mathsf{IB} describes the entire rate-region of an information-theoretic setup, 𝖯𝖥\mathsf{PF} specifies only a corner point of a rate region. Other information-theoretic settings related to 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} include CEO problem [Courtade_CEO] and source coding for the Gray-Wyner network [Cheuk_LI_GrayWyner].

II-C1 Noisy Source Coding

Suppose Alice has access only to a noisy version XX of a source of interest YY. She wishes to transmit a rate-constrained description from her observation (i.e., XX) to Bob such that he can recover YY with small average distortion. More precisely, let (Xn,Yn)(X^{n},Y^{n}) be nn i.i.d. samples of (X,Y)PXY(X,Y)\sim P_{XY}. Alice encodes her observation XnX^{n} through an encoder ϕ:𝒳n{1,,Kn}\phi:{\mathcal{X}}^{n}\to\{1,\dots,K_{n}\} and sends ϕ(Xn)\phi(X^{n}) to Bob. Upon receiving ϕ(Xn)\phi(X^{n}), Bob reconstructs a "soft" estimate of YnY^{n} via a decoder ψ:{1,,Kn}𝒴^n\psi:\{1,\dots,K_{n}\}\to\widehat{\mathcal{Y}}^{n} where 𝒴^=𝒫(𝒴)\widehat{\mathcal{Y}}={\mathcal{P}}({\mathcal{Y}}). That is, the reproduction sequence y^n\hat{y}^{n} consists of nn probability measures on 𝒴{\mathcal{Y}}. For any source and reproduction sequences yny^{n} and y^n\hat{y}^{n}, respectively, the distortion is defined as

d(yn,y^n)1ni=1nd(yi,y^i),d(y^{n},\hat{y}^{n})\coloneqq\frac{1}{n}\sum_{i=1}^{n}d(y_{i},\hat{y}_{i}),

where

d(y,y^)log1y^(y).d(y,\hat{y})\coloneqq\log\frac{1}{\hat{y}(y)}. (27)

We say that a pair of rate-distortion (𝖱,𝖣)(\mathsf{R},\mathsf{D}) is achievable if there exists a pair (ϕ,ψ)(\phi,\psi) of encoder and decoder such that

lim supn𝔼[d(Yn,ψ(ϕ(Xn)))]𝖣,andlim supn1nlogKn𝖱.\limsup_{n\to\infty}{\mathbb{E}}[d(Y^{n},\psi(\phi(X^{n})))]\leq\mathsf{D},\qquad\text{and}\qquad\limsup_{n\to\infty}\frac{1}{n}\log K_{n}\leq\mathsf{R}. (28)

The noisy rate-distortion function 𝖱𝗇𝗈𝗂𝗌𝗒(𝖣)\mathsf{R}^{\mathsf{noisy}}(\mathsf{D}) for a given 𝖣0\mathsf{D}\geq 0, is defined as the minimum rate 𝖱\mathsf{R} such that (𝖱,𝖣)(\mathsf{R},\mathsf{D}) is an achievable rate-distortion pair. This problem arises naturally in many data analytic problems. Some examples include feature selection of a high-dimensional dataset, clustering, and matrix completion. This problem was first studied by Dobrushin and Tsybakov [Noisy_SourceCoding_Dobrushin], who showed that 𝖱𝗇𝗈𝗂𝗌𝗒(𝖣)\mathsf{R}^{\mathsf{noisy}}(\mathsf{D}) is analogous to the classical rate-distortion function

𝖱𝗇𝗈𝗂𝗌𝗒(𝖣)\displaystyle\mathsf{R}^{\mathsf{noisy}}(\mathsf{D}) =infPY^|X:𝔼[d(Y,Y^)]𝖣,YXY^I(X;Y^).\displaystyle=\inf_{\begin{subarray}{c}P_{\hat{Y}|X}:{\mathbb{E}}[d(Y,\hat{Y})]\leq\mathsf{D},\\ Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}\hat{Y}\end{subarray}}I(X;\hat{Y}). (29)

It can be easily verified that 𝔼[d(Y,Y^)]=H(Y|Y^){\mathbb{E}}[d(Y,\hat{Y})]=H(Y|\hat{Y}) and hence (after relabeling Y^\hat{Y} as TT)

𝖱𝗇𝗈𝗂𝗌𝗒(𝖣)=infPT|X:I(Y;T)R,YXTI(X;T),\mathsf{R}^{\mathsf{noisy}}(\mathsf{D})=\inf_{\begin{subarray}{c}P_{T|X}:I(Y;T)\geq R,\\ Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\end{subarray}}I(X;T), (30)

where R=H(Y)𝖣R=H(Y)-\mathsf{D}, which is equal to 𝖨𝖡~\widetilde{\mathsf{IB}} defined in (4). For more details in connection between noisy source coding and 𝖨𝖡\mathsf{IB}, the reader is referred to [Bottleneck_Shamai, Bottleneck_Polyanskiy, Courtade_CEO, Collaborative_IB]. Notice that one can study an essentially identical problem where the distortion constraint (28) is replaced by

limn1nI(Yn;ψ(ϕ(Xn)))R,andlim supn1nlogKn𝖱.\lim_{n\to\infty}\frac{1}{n}I(Y^{n};\psi(\phi(X^{n})))\geq R,\qquad\text{and}\qquad\limsup_{n\to\infty}\frac{1}{n}\log K_{n}\leq\mathsf{R}.

This problem is addressed in [IB_operational] for discrete alphabets 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} and extended recently in [IB_General] for any general alphabets.

II-C2 Test Against Independence with Communication Constraint

As mentioned earlier, the connection between 𝖨𝖡\mathsf{IB} and noisy source coding, described above, was known and studied in [Bottleneck_Shamai, Bottleneck_Polyanskiy]. Here, we provide a new information-theoretic setting which provides yet another operational meaning for 𝖨𝖡\mathsf{IB}. Given nn i.i.d. samples (X1,Y1),,(Xn,Yn)(X_{1},Y_{1}),\dots,(X_{n},Y_{n}) from joint distribution QQ, we wish to test whether XiX_{i} are independent of YiY_{i}, that is, QQ is a product distribution. This task is formulated by the following hypothesis test:

H0:Q=PXY,H1:Q=PXPY,\displaystyle\begin{aligned} H_{0}:&\leavevmode\nobreak\ \leavevmode\nobreak\ Q=P_{XY},\\ H_{1}:&\leavevmode\nobreak\ \leavevmode\nobreak\ Q=P_{X}P_{Y},\end{aligned} (31)

for a given joint distribution PXYP_{XY} with marginals PXP_{X} and PYP_{Y}. Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] investigated this problem under a communication constraint: While YY observations (i.e., Y1,,YnY_{1},\dots,Y_{n}) are available, the XX observations need to be compressed at rate RR, that is, instead of XnX^{n}, only ϕ(Xn)\phi(X^{n}) is present where ϕ:𝒳n{1,,Kn}\phi:{\mathcal{X}}^{n}\to\{1,\dots,K_{n}\} satisfies

1nlogKnR.\frac{1}{n}\log K_{n}\leq R.

For the type I error probability not exceeding a fixed ε(0,1)\varepsilon\in(0,1), Ahlswede and Csiszár [Hypothesis_Testing_Ahslwede] derived the smallest possible type 2 error probability, defined as

βR(n,ε)=minϕ:𝒳n[K]1nlogKnRminA[Kn]×𝒴n{(Pϕ(Xn)×PYn)(A):Pϕ(Xn)×Yn(A)1ε}.\beta_{R}(n,\varepsilon)=\min_{\phi:{\mathcal{X}}^{n}\to[K]\atop\frac{1}{n}\log K_{n}\leq R}\min_{A\subset[K_{n}]\times{\mathcal{Y}}^{n}}\Big{\{}(P_{\phi(X^{n})}\times P_{Y^{n}})(A):\leavevmode\nobreak\ \leavevmode\nobreak\ P_{\phi(X^{n})\times Y^{n}}(A)\geq 1-\varepsilon\Big{\}}.

The following gives the asymptotic expression of βR(n,ε)\beta_{R}(n,\varepsilon) for every ε(0,1)\varepsilon\in(0,1). For the proof, refer to [Hypothesis_Testing_Ahslwede, Theorem 3].

Theorem 7 ([Hypothesis_Testing_Ahslwede]).

For every R0R\geq 0 and ε(0,1)\varepsilon\in(0,1), we have

limn1nlogβR(n,ε)=𝖨𝖡(R).\lim_{n\to\infty}-\frac{1}{n}\log\beta_{R}(n,\varepsilon)=\mathsf{IB}(R).

In light of this theorem, 𝖨𝖡(R)\mathsf{IB}(R) specifies the exponential rate at which the type II error probability of the hypothesis test (31) decays as the number of samples increases.

II-C3 Dependence Dilution

Inspired by the problems of information amplification [Cover_State_Amplification] and state masking [Merhav_state_masking], Asoodeh et al. [Asoode_submitted] proposed the dependence dilution setup as follows. Consider a source sequences XnX^{n} of nn i.i.d. copies of XPXX\sim P_{X}. Alice observes the source XnX^{n} and wishes to encode it via the encoder

fn:𝒳n{1,2,,2nR},f_{n}:{\mathcal{X}}^{n}\to\{1,2,\dots,2^{nR}\},

for some R>0R>0. The goal is to ensure that any user observing fn(Xn)f_{n}(X^{n}) can construct a list, of fixed size, of sequences in 𝒳n{\mathcal{X}}^{n} that contains likely candidates of the actual sequence XnX^{n} while revealing negligible information about a correlated source YnY^{n}. To formulate this goal, consider the decoder

gn:{1,2,,2nR}2𝒳n,g_{n}:\{1,2,\dots,2^{nR}\}\to 2^{{\mathcal{X}}^{n}},

where 2𝒳n2^{{\mathcal{X}}^{n}} denotes the power set of 𝒳n{\mathcal{X}}^{n}. A dependence dilution triple (R,Γ,Δ)3+(R,\Gamma,\Delta)\in\mathbb{R}^{3}_{+} is said to be achievable if, for any δ>0\delta>0, there exists a pair of encoder and decoder (fn,gn)(f_{n},g_{n}) such that for sufficiently large nn

Pr(Xngn(J))<δ,\Pr\left(X^{n}\notin g_{n}(J)\right)<\delta, (32)

having fixed size |gn(J)|=2n(H(X)Γ),|g_{n}(J)|=2^{n(H(X)-\Gamma)}, where J=fn(Xn)J=f_{n}(X^{n}) and simultaneously

1nI(Yn;J)Δ+δ.\frac{1}{n}I(Y^{n};J)\leq\Delta+\delta. (33)

Notice that without side information JJ, the decoder can only construct a list of size 2nH(X)2^{nH(X)} which contains XnX^{n} with probability close to one. However, after JJ is observed and the list gn(J)g_{n}(J) is formed, the decoder’s list size can be reduced to 2n(H(X)Γ)2^{n(H(X)-\Gamma)} and thus reducing the uncertainty about XnX^{n} by nΓ[0,nH(X)]n\Gamma\in[0,nH(X)]. This observation can be formalized to show (see [Cover_State_Amplification] for details) that the constraint (32) is equivalent to

1nI(Xn;J)Γδ,\frac{1}{n}I(X^{n};J)\geq\Gamma-\delta, (34)

which lower bounds the amount of information JJ carries about XnX^{n}. Built on this equivalent formulation, Asoodeh et al. [Asoode_submitted, Corollary 15] derived a necessary condition for the achievable dependence dilution triple.

Theorem 8 ([Asoode_submitted]).

Any achievable dependence dilution triple (R,Γ,Δ)(R,\Gamma,\Delta) satisfies

{RΓΓI(X;T)ΔI(Y;T)I(X;T)+Γ,\begin{cases}R&\geq\Gamma\\ \Gamma&\leq I(X;T)\\ \Delta&\geq I(Y;T)-I(X;T)+\Gamma,\end{cases}

for some auxiliary random variable TT satisfying YXTY\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T and taking |𝒯||𝒳|+1|{\mathcal{T}}|\leq|{\mathcal{X}}|+1 values.

According to this theorem, 𝖯𝖥(Γ)\mathsf{PF}(\Gamma) specifies the best privacy performance of the dependence dilution setup for the maximum amplification rate Γ\Gamma. While this informs the operational interpretation of 𝖯𝖥\mathsf{PF}, Theorem 8 only provides an outer bound for the set of achievable dependence dilution triple (R,Γ,Δ)(R,\Gamma,\Delta). It is, however, not clear that 𝖯𝖥\mathsf{PF} characterizes the rate region of an information-theoretic setup.

The fact that 𝖨𝖡\mathsf{IB} fully characterizes the rate-region of an source coding setup has an important consequence: the cardinality of the auxiliary random variable TT in 𝖨𝖡\mathsf{IB} can be improved to |𝒳||{\mathcal{X}}| instead of |𝒳|+1|{\mathcal{X}}|+1.

II-D Cardinality Bound

Recall that in the definition of 𝖨𝖡\mathsf{IB} in (4), no assumption was imposed on the auxiliary random variable TT. A straightforward application of Carathéodory-Fenchel-Eggleston theorem222This is a strengthening of the original Carathéodory theorem when the underlying space is connected, see e.g., [Witsenhausen_Convexity, Section III] or [csiszarbook, Lemma 15.4]. reveals that 𝖨𝖡\mathsf{IB} is attained for TT taking values in a set 𝒯{\mathcal{T}} with cardinality |𝒯||𝒳|+1|{\mathcal{T}}|\leq|{\mathcal{X}}|+1. Here, we improve this bound and show that cardinality bound to |𝒯||𝒳||{\mathcal{T}}|\leq|{\mathcal{X}}|.

Theorem 9.

For any joint distribution PXYP_{XY} and R(0,H(X)]R\in(0,H(X)], information bottleneck 𝖨𝖡(R)\mathsf{IB}(R) is achieved by TT taking at most |𝒳||{\mathcal{X}}| values.

The proof of this theorem hinges on the operational characterization of 𝖨𝖡\mathsf{IB} as the lower boundary of the rate-distortion region of noisy source coding problem discussed in Section II-C. Specifically, we first show that the extreme points of this region is achieved by TT taking |𝒳||{\mathcal{X}}| values. We then make use of a property of the noisy source coding problem (namely, time-sharing) to argue that all points of this region (including the boundary points) can be attained by such TT. It must be mentioned that this result was already claimed by Harremoës and Tishby in [harremoes2007information] without proof.

In many practical scenarios, feature XX has a large alphabet. Hence, the bound |𝒯||𝒳||{\mathcal{T}}|\leq|{\mathcal{X}}|, albeit optimal, still can make the information bottleneck function computationally intractable over large alphabets. However, label YY usually has a significantly smaller alphabet. While it is in general impossible to have a cardinality bound for TT in terms of |𝒴||{\mathcal{Y}}|, one can consider approximating 𝖨𝖡\mathsf{IB} assuming TT takes NN values. The following result, recently proved by Hirche and Winter [Hirche_IB_cardinality], is in this spirit.

Theorem 10 ([Hirche_IB_cardinality]).

For any (X,Y)PXY(X,Y)\sim P_{XY}, we have

𝖨𝖡(R,N)𝖨𝖡(R)𝖨𝖡(R,N)+δ(N),\mathsf{IB}(R,N)\leq\mathsf{IB}(R)\leq\mathsf{IB}(R,N)+\delta(N),

where δ(N)=4N1|𝒴|[log|𝒴|4+1|𝒴|logN]\delta(N)=4N^{-\frac{1}{|{\mathcal{Y}}|}}\left[\log\frac{|{\mathcal{Y}}|}{4}+\frac{1}{|{\mathcal{Y}}|}\log N\right] and 𝖨𝖡(R,N)\mathsf{IB}(R,N) denotes the information bottleneck functional (4) with the additional constraint that |𝒯|N|{\mathcal{T}}|\leq N.

Recall that, unlike 𝖯𝖥\mathsf{PF}, the graph of 𝖨𝖡\mathsf{IB} characterizes the rate region of a Shannon-theoretic coding problem (as illustrated in Section II-C), and hence any boundary points can be constructed via time-sharing of extreme points of the rate region. This lack of operational characterization of 𝖯𝖥\mathsf{PF} translates into a worse cardinality bound than that of 𝖨𝖡\mathsf{IB}. In fact, for 𝖯𝖥\mathsf{PF} the cardinality bound |𝒯||𝒳|+1|{\mathcal{T}}|\leq|{\mathcal{X}}|+1 cannot be improved in general. To demonstrate this, we numerically solve the optimization in 𝖯𝖥\mathsf{PF} assuming that |𝒯|=|𝒳||{\mathcal{T}}|=|{\mathcal{X}}| when both XX and YY are binary. As illustrated in Fig. 5, this optimization does not lead to a convex function, and hence, cannot be equal to 𝖯𝖥\mathsf{PF}.

Refer to caption
Figure 5: The set {(I(X;T),I(Y;T))}\{(I(X;T),I(Y;T))\} with PX=𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(0.9)P_{X}={\mathsf{Bernoulli}}(0.9), PY|X=0=[0.9,0.1]P_{Y|X=0}=[0.9,0.1], PY|X=1=[0.85,0.15]P_{Y|X=1}=[0.85,0.15], and TT restricted to be binary. While the upper boundary of this set is concave, the lower boundary is not convex. This implies that, unlike 𝖨𝖡\mathsf{IB}, 𝖯𝖥(r)\mathsf{PF}(r) cannot be attained by binary variables TT.

II-E Deterministic Information Bottleneck

As mentioned earlier, 𝖨𝖡\mathsf{IB} formalizes an information-theoretic approach to clustering high-dimensional feature XX into cluster labels TT that preserve as much information about the label YY as possible. The clustering label is assigned by the soft operator PT|XP_{T|X} that solves the 𝖨𝖡\mathsf{IB} formulation (4) according to the rule: X=xX=x is likely assigned label T=tT=t if D𝖪𝖫(PY|xPY|t)D_{\mathsf{KL}}(P_{Y|x}\|P_{Y|t}) is small where PY|t=xPY|xPX|tP_{Y|t}=\sum_{x}P_{Y|x}P_{X|t}. That is, clustering is assigned based on the similarity of conditional distributions. As in many practical scenarios, a hard clustering operator is preferred, Strouse and Schwab [strouse2017dib] suggested the following variant of 𝖨𝖡\mathsf{IB}, termed as deterministic information bottleneck 𝖽𝖨𝖡\mathsf{dIB}

𝖽𝖨𝖡(PXY,R)supf:𝒳𝒯,H(f(X))RI(Y;f(X)),\mathsf{dIB}(P_{XY},R)\coloneqq\sup_{\begin{subarray}{c}f:{\mathcal{X}}\to{\mathcal{T}},\\ H(f(X))\leq R\end{subarray}}I(Y;f(X)), (35)

where the maximization is taken over all deterministic functions ff whose range is a finite set 𝒯{\mathcal{T}}. Similarly, one can define

𝖽𝖯𝖥(PXY,r)inff:𝒳𝒯,H(f(X))rI(Y;f(X)).\mathsf{dPF}(P_{XY},r)\coloneqq\inf_{\begin{subarray}{c}f:{\mathcal{X}}\to{\mathcal{T}},\\ H(f(X))\geq r\end{subarray}}I(Y;f(X)). (36)

One way to ensure that H(f(X))RH(f(X))\leq R for a deterministic function ff is to restrict the cardinality of the range of ff: if f:𝒳[eR]f:{\mathcal{X}}\to[e^{R}] then H(f(X))H(f(X)) is necessarily smaller than RR. Using this insight, we derive a lower for 𝖽𝖨𝖡(PXY,R)\mathsf{dIB}(P_{XY},R) in the following lemma.

Lemma 5.

For any given PXYP_{XY}, we have

𝖽𝖨𝖡(PXY,R)eR1|𝒳|I(X;Y),\mathsf{dIB}(P_{XY},R)\geq\frac{e^{R}-1}{|{\mathcal{X}}|}I(X;Y),

and

𝖽𝖯𝖥(PXY,r)er1|𝒳|I(X;Y)+Pr(Xer)log1Pr(Xer).\mathsf{dPF}(P_{XY},r)\leq\frac{e^{r}-1}{|{\mathcal{X}}|}I(X;Y)+\Pr(X\geq e^{r})\log\frac{1}{\Pr(X\geq e^{r})}.

Note that both RR and rr are smaller than H(X)H(X) and thus the multiplicative factors of I(X;Y)I(X;Y) in the lemma are smaller than one. In light of this lemma, we can obtain

eR1|𝒳|I(X;Y)𝖨𝖡(R)I(X;Y),\frac{e^{R}-1}{|{\mathcal{X}}|}I(X;Y)\leq\mathsf{IB}(R)\leq I(X;Y),

and

𝖯𝖥(r)er1|𝒳|I(X;Y)+Pr(Xer)log1Pr(Xer).\mathsf{PF}(r)\leq\frac{e^{r}-1}{|{\mathcal{X}}|}I(X;Y)+\Pr(X\geq e^{r})\log\frac{1}{\Pr(X\geq e^{r})}.

In most of practical setups, |𝒳||{\mathcal{X}}| might be very large, making the above lower bound for 𝖨𝖡\mathsf{IB} vacuous. In the following lemma, we partially address this issue by deriving a bound independent of 𝒳{\mathcal{X}} when YY is binary.

Lemma 6.

Let PXYP_{XY} be a joint distribution of arbitrary XX and binary Y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(q)Y\sim{\mathsf{Bernoulli}}(q) for some q(0,1)q\in(0,1). Then, for any Rlog5R\geq\log 5 we have

𝖽𝖨𝖡(PXY,R)I(X;Y)2αh𝖻(I(X;Y)2α(eR4)),\mathsf{dIB}(P_{XY},R)\geq I(X;Y)-2\alpha h_{\mathsf{b}}\Big{(}\frac{I(X;Y)}{2\alpha(e^{R}-4)}\Big{)},

where α=max{log1q,log11q}\alpha=\max\{\log\frac{1}{q},\log\frac{1}{1-q}\}.

III Family of Bottleneck Problems

In this section, we introduce a family of bottleneck problems by extending 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF} to a large family of statistical measures. Similar to 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}, these bottleneck problems are defined in terms of boundaries of a two-dimensional convex set induced by a joint distribution PXYP_{XY}. Recall that R𝖨𝖡(PXY,R)R\mapsto\mathsf{IB}(P_{XY},R) and r𝖯𝖥(PXY,r)r\mapsto\mathsf{PF}(P_{XY},r) are the upper and lower boundary of the set {\mathcal{M}} defined in (6) and expressed here again for convenience

={(I(X;T),I(Y;T)):YXT,(X,Y)PXY}.{\mathcal{M}}=\big{\{}(I(X;T),I(Y;T)):Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T,(X,Y)\sim P_{XY}\big{\}}. (37)

Since PXYP_{XY} is given, H(X)H(X) and H(Y)H(Y) are fixed. Thus, in characterizing {\mathcal{M}} it is sufficient to consider only H(X|T)H(X|T) and H(Y|T)H(Y|T). To generalize 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}, we must therefore generalize H(X|T)H(X|T) and H(Y|T)H(Y|T).

Given a joint distribution PXYP_{XY} and two non-negative real-valued functions Φ:𝒫(𝒳)+\Phi:{\mathcal{P}}({\mathcal{X}})\to\mathbb{R}^{+} and Ψ:𝒫(𝒴)+\Psi:{\mathcal{P}}({\mathcal{Y}})\to\mathbb{R}^{+}, we define

Φ(X|T)𝔼[Φ(PX|T)]=t𝒯PT(t)Φ(PX|T=t),\Phi(X|T)\coloneqq{\mathbb{E}}\left[\Phi(P_{X|T})\right]=\sum_{t\in{\mathcal{T}}}P_{T}(t)\Phi(P_{X|T=t}), (38)

and

Ψ(Y|T)𝔼[Ψ(PY|T)]=t𝒯PT(t)Ψ(PY|T=t).\Psi(Y|T)\coloneqq{\mathbb{E}}\left[\Psi(P_{Y|T})\right]=\sum_{t\in{\mathcal{T}}}P_{T}(t)\Psi(P_{Y|T=t}). (39)

When XPXX\sim P_{X} and YPYY\sim P_{Y}, we interchangeably write Φ(X)\Phi(X) for Φ(PX)\Phi(P_{X}) and Φ(Y)\Phi(Y) for Ψ(PY)\Psi(P_{Y}).

These definitions provide natural generalizations for Shannon’s entropy and mutual information. Moreover, as we discuss later in Sections LABEL:Sec:Geussing and LABEL:Sec:Arimoto, it also can be specialized to represent a large family of popular information-theoretic and statistical measures. Examples include information and estimation theoretic quantities such as Arimoto’s conditional entropy of order α\alpha for Φ(QX)=||QX||α\Phi(Q_{X})=||Q_{X}||_{\alpha}, probability of correctly guessing for Φ(QX)=||QX||\Phi(Q_{X})=||Q_{X}||_{\infty}, maximal correlation for binary case, and ff-information for Φ(QX)\Phi(Q_{X}) given by ff-divergence. We are able to generate a family of bottleneck problems using different instantiations of Φ(X|T)\Phi(X|T) and Ψ(Y|T)\Psi(Y|T) in place of mutual information in 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}. As we argue later, these problems better capture the essence of "informativeness" and "privacy"; thus providing analytical and interpretable guarantees similar in spirit to 𝖨𝖡\mathsf{IB} and 𝖯𝖥\mathsf{PF}.

Computing these bottleneck problems in general boils down to the following optimization problems

𝖴Φ,Ψ(ζ)supPT|X:YXTΦ(X|T)ζΨ(Y|T),{\mathsf{U}}_{\Phi,\Psi}(\zeta)\coloneqq\sup_{\begin{subarray}{c}P_{T|X}:Y\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}X\mathrel{\multimap}\joinrel\mathrel{-}\mspace{-9.0mu}\joinrel\mathrel{-}T\\ \Phi(X|T)\leq\zeta\end{subarray}}\Psi(Y|T), (40)

and