This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Few Interactions Improve Distributed Nonparametric Estimation, Optimally

Jingbo Liu Jingbo Liu is with the Department of Statistics and the Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL, 61820, US. Email: [email protected] received 16-Feb-2022; revised 27-Nov-2022. Associate editor: Himanshu Tyagi.This paper was presented in part at 2022 IEEE International Symposium on Information Theory (ISIT) [29].
Abstract

Consider the problem of nonparametric estimation of an unknown β\beta-Hölder smooth density pXYp_{XY} at a given point, where XX and YY are both dd dimensional. An infinite sequence of i.i.d. samples (Xi,Yi)(X_{i},Y_{i}) are generated according to this distribution, and two terminals observe (Xi)(X_{i}) and (Yi)(Y_{i}), respectively. They are allowed to exchange kk bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order (klogk)2βd+2β\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}} for one-way protocols and k2βd+2βk^{-\frac{2\beta}{d+2\beta}} for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of kk. The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables, and the lower bound is built on a sharp estimate of a symmetric strong data processing constant for biased Bernoulli variables.

Index Terms:
Density estimation, Communication complexity, Nonparametric statistics, Learning with system constraints, Strong data processing constant

I Introduction

The communication complexity problem was introduced in the seminal paper of Yao [50] (see also [26] for a survey), where two terminals (which we call Alice and Bob) compute a given Boolean function of their local inputs (𝐗=(Xi)i=1n{\bf X}=(X_{i})_{i=1}^{n} and 𝐘=(Yi)i=1n{\bf Y}=(Y_{i})_{i=1}^{n}) by means of exchanging messages. The famous log-rank conjecture provides an estimate of the communication complexity of a general Boolean function, which is still open to date. Meanwhile, communication complexity of certain specific functions can be better understood. For example, the Gap-Hamming problem [24][12] concerns testing f(𝐗,𝐘)>1nf({\bf X,Y})>\frac{1}{n} against f(𝐗,𝐘)<1nf({\bf X,Y})<-\frac{1}{n}, where f(𝐗,𝐘):=1ni=1nXiYif({\bf X,Y}):=\frac{1}{n}\sum_{i=1}^{n}X_{i}Y_{i} denotes the sample correlation and Xi,Yi{+1,1}X_{i},Y_{i}\in\{+1,-1\}. It was shown in [12] with a geometric argument that the communication complexity (for worst-case deterministic 𝐗,𝐘{\bf X,Y}) is Θ(n)\Theta(n), therefore a one-way protocol where Alice simply sends 𝐗{\bf X} cannot be improved (up to a multiplicative constant) by an interactive protocol.

Gap-Hamming is closely related to the problem of estimating the joint distribution of a pair of binary or Gaussian random variables (using nn i.i.d. samples). Indeed, for nn large we may assume that Alice (resp. Bob) can estimate the marginal distributions of XX (resp. YY) very well, so that the joint distribution is parameterized by only one scalar which is the correlation. An information-theoretic proof of Gap-Hamming was previously provided in [19], building on a converse for correlation estimation for the binary symmetric distribution, and pinned down the exact prefactor in the risk-communication tradeoff. In particular, the result of [19] implies that the naive algorithm where Alice simply sends 𝐗\bf X can be improved by a constant factor in the estimation risk by a more sophisticated scheme using additional samples. For the closely related problem of correlation (distribution) testing, [38] and [48] provided asymptotically tight bounds on the communication complexity under the one-way and interactive protocols when the null hypothesis is the independent distribution (zero correlation), which also implies that the error exponent can be improved by an algorithm using additional samples. The technique of [48] is based on the tensorization of internal and external information ((20) ahead), whereas the bound of [38] uses hypercontractivity. More recently, [20] derived bounds for testing against dependent distributions using optimal transport inequalities.

In this paper, we take the natural step of introducing nonparametric (NP) statistics to Alice and Bob, whereby two parties estimate a nonparametric density by means of sending messages interactively. It will be seen that this problem is closely related to a “sparse” version of the aforementioned Gap-Hamming problem, where interaction does help, in contrast to the usual Gap-Hamming problem.

For concreteness, consider the problem of nonparametric estimation of an unknown β\beta-Hölder smooth density pXYp_{XY} at a given point (x0,y0)(x_{0},y_{0}). For simplicity we assume the symmetric case where XX and YY are both dd dimensional. An infinite sequence of i.i.d. samples (Xi,Yi)(X_{i},Y_{i}) are generated according to pXYp_{XY}, and Alice and Bob observe (Xi)(X_{i}) and (Yi)(Y_{i}), respectively. After they exchange kk bits (either in one-way or interactively), Bob estimates the unknown density at the given point. We successfully characterize the minimax rate in terms of the communication complexity kk: it is order (klogk)2βd+2β\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}} for one-way protocols and k2βd+2βk^{-\frac{2\beta}{d+2\beta}} for interactive protocols.

Notably, allowing interaction strictly improves the estimation risk. Previously, separations between one-way and interactive protocols are known but in very different contexts: In [32, Corollary 1] (see also [31]), the separation was found in the rate region of common randomness generation from biased binary distributions, using certain convexity arguments, but this only implies a difference in the leading constant, rather than the asymptotic scaling. On the other hand, the example distribution in [42] is based on the pointer-chasing construction of [35], which appears to be a highly artificial distribution designed to entail a separation between the one-way and interactive protocols. Another example where interaction improves zero-error source coding with side information, based on a “bit location” algorithm, was described in [36], and it was shown that two-way communication complexity differs from interactivity communication complexity only by constant factors. In contrast, the logarithmic separation in the present paper arises from the nonparametric nature of the problem: If we consider the problem of correlation estimation for Bernoulli pairs with a fixed bias (a parametric problem), the risk will be order k12k^{-\frac{1}{2}}, and there will be no separation between one-way and interactive protocols (which is indeed the case in [19]). In contrast, nonparametric estimation is analogous to Bernoulli correlation estimation where the bias changes with kk (since the optimal bandwidth adapts to kk), which gives rise to the separation.

For the risk upper bound, in the one-way setting it is efficient for Alice to just encode the set of ii’s such that XiX_{i} falls within a neighborhood (computed by the optimal bandwidth for a given kk) of the given point x0x_{0}. To achieve the optimal k2βd+2βk^{-\frac{2\beta}{d+2\beta}} rate for interactive protocols, we provide a novel scheme that uses r>1r>1 rounds of interactions, where r=r(k)r=r(k) grows as slowly as the super logarithm (i.e. the inverse of tetration) of kk. With the sequence of r(k)r(k) we use in Section V-C (and suppose that β=d=1\beta=d=1), while r=4r=4 interactions is barely enough for kk equal to the number of letters in a short sentence, r=8r=8 is more than sufficient for kk equal to the number of all elementary particles in the entire observable universe. Thus from a practical perspective, r(k)r(k) is effectively a constant, although it remains an interesting theoretical question whether r(k)r(k) really diverges (Conjecture 1).

For the lower bound, the proof is based on the symmetric data processing constant introduced in [32]. Previously, the data processing constant srs^{*}_{r} has been connected to two-party estimation and hypothesis testing in [19]; the idea was canonized as the following statement: “Information for hypothesis testing locally” is upper bounded by srs^{*}_{r} times “Information communicated mutually”. However, srs^{*}_{r} is not easy to compute, and previous bounds on srs^{*}_{r} are also not tight enough for our purpose. Instead, we first use an idea of simulation of continuous variables to reduce the problem to estimation of Bernoulli distributions, for which srs^{*}_{r} is easier to analyze. Then we use some new arguments to bound ss^{*}_{\infty}.

Let us emphasize that this paper concerns density estimation at a given point, rather than estimating the global density function. For the latter problem, it is optimal for Alice to just quantize the samples and send it to Bob, which we show in the companion paper [28] . The mean square error (in 2\ell_{2} norm) of estimating global density function scales differently than the case of point-wise density estimation since the messages cannot be tailored to the given point.

Related work. Besides function computation, distribution estimation and testing, other problems which have been studied in the communication complexity or privacy settings include lossy source coding [25] and common randomness or secret key generation [47][30][32][43]. The key technical tool for interactive two-way communication models, namely the tensorization of internal and external information ((20) ahead), appeared in [25] for lossy compression, [9][33] for function computation, [47][32] for common randomness generation, and [48][19] for parameter estimation.

For one-way communication models, the main tool is a tensorization property related to the strong data processing constant (see (10) ahead), which was first used in [4] in the study of the error exponents in communication constrained hypothesis testing. The hypercontractivity method for single-shot bounds in one-way models was used in [30][27] for common randomness generation and [38] for testing.

In statistics, communication-constrained estimation has received considerable attention recently, starting from [52], which considered a model where distributed samples are compressed and sent to a central estimator. Further works on this communication model include settings of Gaussian location estimation [10][11], parametric estimation [22], nonparametric regression [53], Gaussian noise model [54], statistical inference [2], and nonparametric estimation [21](with a bug fixed in [7]) [1]. Related problems solved using similar techniques include differential privacy [16] and data summarization [37][46][45]. Communication-efficient construction of test statistics for distributed testing using the divide-and-conquer algorithm is studied in [8]. Generally speaking, these works on statistical minimax rates concern the so-called horizontal partitioning of data sets, where data sets share the same feature space but differ in samples [49][18]. In contrast, vertical distributed or federated learning, where data sets differ in features, has been used by corporations such as those in finance and medical care [49][18]. It is worth mentioning that such horizontal partitioning model was also introduced in Yao’s paper [50] in the context of function computation under the name “simultaneous message model”, where different parties send messages to a referee instead of to each other. The direct sum property (similar to the tensorization property of internal and external information) of the simultaneous message model was discussed in [13].

Organization of the paper. We review the background on nonparametric estimation, data processing constants and testing independence in Section II. The formulation of the two-party nonparametric estimation problem and the summary of main results are given in Section III. Section IV examines the problem of estimating a parameter in a pair of biased Bernoulli distributions, which will be used as a building block in our nonparametric estimation algorithm. Section V proves some bounds on information exchanges, which will be the key auxiliary results for the proof of upper bound for Bernoulli estimation in Section VI, and for nonparametric estimation in Section VII. Finally, lower bounds are proved in Section VIII in the one-way case and in Section IX in the interactive case.

II Preliminaries

II-A Notation

We use capital letters for probability measures and lower cases for the densities functions. We use the abbreviations Uij:=(Ui,,Uj)U_{i}^{j}:=(U_{i},\dots,U_{j}) and Uj:=U1jU^{j}:=U_{1}^{j}. We use boldface letters to denote vectors, for example 𝐔i=(Ui(l))l=1n{\bf U}_{i}=(U_{i}(l))_{l=1}^{n}. Unless otherwise specified, the base of logarithms can be arbitrary but remain consistent throughout equations. The precise meaning of the Landau notations, such as O()O(\cdot), will be explained in each section or proofs of specific theorems. We use 1irodd\sum_{1\leq i\leq r}^{\rm odd} to denote summing over i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z}. For the vector representation of a binary probability distribution, we use the convention that PU=[PU(0),PU(1)]P_{U}=[P_{U}(0),P_{U}(1)]. For the matrix representation of a joint distribution of a pair of binary random variables, we use the convention that PXY=[PXY(0,0)PXY(0,1)PXY(1,0)PXY(1,1)]P_{XY}=\begin{bmatrix}P_{XY}(0,0)&P_{XY}(0,1)\\ P_{XY}(1,0)&P_{XY}(1,1)\end{bmatrix}. For x[0,1]x\in[0,1], we use the shorthand x¯:=1x\bar{x}:=1-x.

II-B Nonparametric Estimation

Let us recall the basics about the problem of estimating a smooth density; more details may be found in [44, 41]. Let d1d\geq 1 be an integer, and s=(s1,,sd){0,1,2,}ds=(s_{1},\dots,s_{d})\in\{0,1,2,\dots\}^{d} be a multi-index. For x=(x1,,xd)dx=(x_{1},\dots,x_{d})\in\mathbb{R}^{d}, let DsD^{s} denote the differential operator

Ds=s1++sdx1s1xdsd.\displaystyle D^{s}=\frac{\partial^{s_{1}+\dots+s_{d}}}{\partial x_{1}^{s_{1}}\cdots\partial x_{d}^{s_{d}}}. (1)

Given β(0,)\beta\in(0,\infty), let β\lfloor\beta\rfloor be the maximum integer strictly smaller than β\beta [44] (note the difference with the usual conventions). Given a function ff whose domain includes a set 𝒜d\mathcal{A}\subseteq\mathbb{R}^{d}, define f𝒜,β\|f\|_{\mathcal{A},\beta} as the minimum L0L\geq 0 such that

|Dsf(x1)Dsf(x2)|Lx1x22ββ,x1,x2𝒜,\displaystyle|D^{s}f(x_{1})-D^{s}f(x_{2})|\leq L\|x_{1}-x_{2}\|_{2}^{\beta-\lfloor\beta\rfloor},\quad\forall x_{1},x_{2}\in\mathcal{A}, (2)

for all multi-indices ss such that s1++sd=βs_{1}+\dots+s_{d}=\lfloor\beta\rfloor. For example, β=1\beta=1 define a Lipschitz function and an integer β\beta defines a function with bounded β\beta-th derivative.

Given L>0L>0, let 𝒫(β,L)\mathcal{P}(\beta,L) be the class of probability density functions pp satisfying pd,βL\|p\|_{\mathbb{R}^{d},\beta}\leq L. Let x0dx_{0}\in\mathbb{R}^{d} be arbitrary. The following result on the minimax estimation error is well-known:

infTnsupp𝒫(β,L)𝔼[|Tnp(x0)|2]=Θ(n2βd+2β)\displaystyle\inf_{T_{n}}\sup_{p\in\mathcal{P}(\beta,L)}\mathbb{E}[|T_{n}-p(x_{0})|^{2}]=\Theta(n^{-\frac{2\beta}{d+2\beta}}) (3)

where the infimum is over all estimators TnT_{n} of p(x0)p(x_{0}), i.e., measurable maps from i.i.d. samples X1,,XnpX_{1},\dots,X_{n}\sim p to \mathbb{R}. Θ()\Theta(\cdot) in (3) may hide constants independent of nn.

We say K:dK\colon\mathbb{R}^{d}\to\mathbb{R} is a kernel of order ll (l{1,2,}l\in\{1,2,\dots\}) if K=1\int K=1 and all up to the ll-th derivatives of the Fourier transform of KK vanish at 0 [44, Definition 1.3]. Therefore the rectangular kernel, which is the indicator of a set, is order 1. A kernel estimator has the form

Tn=1nhdl=1nK(Xlx0h)\displaystyle T_{n}=\frac{1}{nh^{d}}\sum_{l=1}^{n}K\left(\frac{X_{l}-x_{0}}{h}\right) (4)

where h(0,)h\in(0,\infty) is called bandwidth. If KK is a kernel of order l=βl=\lfloor\beta\rfloor, then the kernel estimator (4) with appropriate hh achieves the bound in (3) [44, Chapter 1]. In particular, the rectangular kernel is minimax optimal for β(0,2]\beta\in(0,2].

If KK is compactly supported, then only local smoothness is needed, and density lower bound does not change the rate: we have

infTnsupp𝒫𝒮(β,L,A)𝔼[|Tnp(x0)|2]=Θ(n2βd+2β)\displaystyle\inf_{T_{n}}\sup_{p\in\mathcal{P}_{\mathcal{S}}(\beta,L,A)}\mathbb{E}[|T_{n}-p(x_{0})|^{2}]=\Theta(n^{-\frac{2\beta}{d+2\beta}}) (5)

where 𝒮\mathcal{S} is any compact neighborhood of x0x_{0}, A[0,1vol(𝒮))A\in[0,\frac{1}{\operatorname{vol}(\mathcal{S})}) is arbitrary (with vol(𝒮)\operatorname{vol}(\mathcal{S}) denoting the volume of 𝒮\mathcal{S}), and 𝒫𝒮(β,L,A)\mathcal{P}_{\mathcal{S}}(\beta,L,A) denotes the non-empty set of probability density functions pp satisfying p𝒮,βL\|p\|_{\mathcal{S},\beta}\leq L and infx𝒮p(x)A\inf_{x\in\mathcal{S}}p(x)\geq A.

II-C Strong and Symmetric Data Processing Constants

The strong data processing constant has proved useful in many distributed estimation problems [10, 4, 16, 52]. In particular, it is strongly connected to two-party hypothesis testing under the one-way protocol. In contrast, the symmetric data processing constant [32] can be viewed as a natural extension to interactive protocols. This section recalls their definitions and auxiliary results, which will mainly be used in the proofs of lower bounds; however, the intuitions are useful for the upper bounds as well.

Given two probability measures PP, QQ on the same measurable space, define the KL divergence

D(PQ):=log(dPdQ)𝑑P.\displaystyle D(P\|Q):=\int\log\left(\frac{dP}{dQ}\right)dP. (6)

Define the χ2\chi^{2}-divergence

Dχ2(PQ):=(dPdQ1)2𝑑Q.\displaystyle D_{\chi^{2}}(P\|Q):=\int\left(\frac{dP}{dQ}-1\right)^{2}dQ. (7)

Let X,YX,Y be two random variables with joint distribution PXYP_{XY}. Define the mutual information

I(X;Y):=D(PXYPX×PY).\displaystyle I(X;Y):=D(P_{XY}\|P_{X}\times P_{Y}). (8)
Definition 1.

Let PXYP_{XY} be an arbitrary distribution on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. Define the strong data processing constant

s(X;Y):=supPU|XI(U;Y)I(U;X)\displaystyle s^{*}(X;Y):=\sup_{P_{U|X}}\frac{I(U;Y)}{I(U;X)} (9)

where PU|XP_{U|X} is a conditional distribution (with 𝒰\mathcal{U} being an arbitrary set), and the mutual informations are computed under the joint distribution PU|XPXYP_{U|X}P_{XY}.

Clearly, the value of s(X;Y)s^{*}(X;Y) does not depend on the choice of the base of logarithm. A basic yet useful property of the strong data processing constant is tensorization: if (𝐗,𝐘)PXYn{\bf(X,Y)}\sim P_{XY}^{\otimes n} then

s(𝐗;𝐘)=s(X;Y).\displaystyle s^{*}({\bf X;Y})=s^{*}(X;Y). (10)

Now if (𝐗;𝐘)({\bf X;Y}) are the samples observed by Alice and Bob, Π1\Pi_{1} denotes the message sent to Bob, then I(Π1;𝐗)kI(\Pi_{1};{\bf X})\leq k implies that

D(PΠ1𝐘PΠ1P𝐘)s(X;Y)k.\displaystyle D(P_{\Pi_{1}{\bf Y}}\|P_{\Pi_{1}}P_{\bf Y})\leq s^{*}(X;Y)k. (11)

The left side is the KL divergence between the distribution under the hypothesis that (X,Y)(X,Y) follows some joint distribution, and the distribution under the hypothesis that XX and YY are independent. Thus the error probabilities in testing against independence with one-way protocols can be lower bounded. This simple argument dates back at least to [4, 3].

A similar argument can be extended to testing independence under interactive protocols [48]. The fundamental fact enabling such extensions is the tensorization of certain information-theoretic quantities, which appeared in various contexts [25, 9, 32].

Definition 2.

Let (X,Y)PXY(X,Y)\sim P_{XY}. For given r<r<\infty, define sr(X;Y)s_{r}^{*}(X;Y) as the supremum of R/SR/S such that there exists random variables U1,,UrU_{1},\dots,U_{r} satisfying

R1iroddI(Ui;Y|Ui1)+1irevenI(Ui;X|Ui1);\displaystyle R\leq\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X|U^{i-1}); (12)
S1iroddI(Ui;X|Ui1)+1irevenI(Ui;Y|Ui1),\displaystyle S\geq\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};X|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};Y|U^{i-1}), (13)

and

Ui(X,Ui1)Y,i{1,,r}2\displaystyle U_{i}-(X,U^{i-1})-Y,\quad i\in\{1,\dots,r\}\setminus 2\mathbb{Z} (14)
Ui(Y,Ui1)X,i{1,,r}2\displaystyle U_{i}-(Y,U^{i-1})-X,\quad i\in\{1,\dots,r\}\cap 2\mathbb{Z} (15)

are Markov chains. We call s(X;Y)s_{\infty}^{*}(X;Y) the symmetric data processing constant.

Let us remark that using the Markov chains we have the right side of (12)

1iroddI(Ui;Y|Ui1)+1irevenI(Ui;X|Ui1)\displaystyle\quad\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X|U^{i-1})
=I(X;Y)I(X;Y|Ur)\displaystyle=I(X;Y)-I(X;Y|U^{r}) (16)
=I(Ur;XY)[I(Ur;X|Y)+I(Ur;Y|X)]\displaystyle=I(U^{r};XY)-[I(U^{r};X|Y)+I(U^{r};Y|X)] (17)

whereas the right side of (13)

1iroddI(Ui;X|Ui1)+1irevenI(Ui;Y|Ui1)=I(Ur;XY).\displaystyle\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};X|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};Y|U^{i-1})=I(U^{r};XY). (18)

In the computer science literature [9], I(Ur;XY)I(U^{r};XY) is called the external information whereas I(Ur;X|Y)+I(Ur;Y|X)I(U^{r};X|Y)+I(U^{r};Y|X) the internal information.

The symmetric strong data processing constant is symmetric in the sense that s(X;Y)=s(Y;X)s_{\infty}^{*}(X;Y)=s_{\infty}^{*}(Y;X), since r=r=\infty in the definition. On the other hand, s1(X;Y)s_{1}^{*}(X;Y) coincides with the strong data processing constant which is generally not symmetric. Furthermore, a tensorization property holds for the internal and external information: denote by (X;Y)\mathcal{R}(X;Y) the set of all (R,S)(R,S) satisfying (12) and (13) for some U1,,UrU_{1},\dots,U_{r}. Let (𝐗,𝐘)PXYn{\bf(X,Y)}\sim P_{XY}^{\otimes n}. Then

(𝐗;𝐘)=n(X;Y).\displaystyle\mathcal{R}({\bf X;Y})=n\mathcal{R}(X;Y). (19)

In particular, taking the slope of the boundary at the original yields

s(𝐗;𝐘)=s(X;Y).\displaystyle s_{\infty}^{*}({\bf X;Y})=s_{\infty}^{*}(X;Y). (20)

A useful and general upper bound on ss_{\infty}^{*} in terms of SVD was provided in [32, Theorem 4], which implies that s=s1s_{\infty}^{*}=s_{1}^{*} when XX and YY are unbiased Bernoulli. However, that bound is not tight enough for the nonparametric estimation problem we consider, and in fact we adopt a new approach in Section IX for the biased Bernoulli distribution. Let us remark that s=s1s_{\infty}^{*}=s_{1}^{*} holds also for Gaussian (X,Y)(X,Y), which follows by combining the result on unbiased Bernoulli distribution and a central limit theorem argument [32] (see also [19]). Moreover, it was conjectured in [32] that the set of possible (R,S)(R,S) satisfying (12)-(13) does not depend on rr when XX and YY are unbiased Bernoulli.

II-D Testing Against Independence

Consider the following setting: PXYP_{XY} is an arbitrary distribution on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}; P𝐗𝐘:=PXYnP_{\bf XY}:=P_{XY}^{\otimes n}; Π=(Π0,,Πr)\Pi=(\Pi_{0},\dots,\Pi_{r}) is a sequence of random variables, with PΠ|𝐗𝐘P_{\Pi|{\bf XY}} being given and satisfying PΠ0|𝐗𝐘=PΠ0P_{\Pi_{0}|{\bf XY}}=P_{\Pi_{0}}, PΠi|𝐗𝐘Π0i1=PΠi|𝐗Π0i1P_{\Pi_{i}|{\bf XY}\Pi_{0}^{i-1}}=P_{\Pi_{i}|{\bf X}\Pi_{0}^{i-1}} for i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z} and PΠi|𝐗𝐘Π0i1=PΠi|𝐘Π0i1P_{\Pi_{i}|{\bf XY}\Pi_{0}^{i-1}}=P_{\Pi_{i}|{\bf Y}\Pi_{0}^{i-1}} for i{1,,r}2i\in\{1,\dots,r\}\cap 2\mathbb{Z}; P¯𝐗𝐘=P𝐗P𝐘\bar{P}_{\bf XY}=P_{\bf X}P_{\bf Y} is the distribution under the hypothesis of independence, and P¯Π𝐗𝐘:=PΠ|𝐗𝐘P¯𝐗𝐘\bar{P}_{\Pi{\bf XY}}:=P_{\Pi|{\bf XY}}\bar{P}_{\bf XY}. The following result is known in [48, 20, 19]:

Lemma 1.

D(P𝐘ΠP¯𝐘Π)I(𝐗;𝐘)I(𝐗;𝐘|Π)D(P_{{\bf Y}\Pi}\|\bar{P}_{{\bf Y}\Pi})\leq I({\bf X;Y})-I({\bf X;Y}|\Pi).

Now by Definition 2, we immediately have

sr(X;Y)\displaystyle s_{r}^{*}(X;Y) I(𝐗;𝐘)I(𝐗;𝐘|Π)I(𝐗𝐘;Π)\displaystyle\geq\frac{I({\bf X;Y})-I({\bf X;Y}|\Pi)}{I({\bf XY};\Pi)} (21)
D(P𝐘ΠP¯𝐘Π)H(Π)\displaystyle\geq\frac{D(P_{{\bf Y}\Pi}\|\bar{P}_{{\bf Y}\Pi})}{H(\Pi)} (22)

which generalizes (11). Therefore, sr(X;Y)s_{r}^{*}(X;Y) can be used to bound D(P𝐘ΠP¯𝐘Π)D(P_{{\bf Y}\Pi}\|\bar{P}_{{\bf Y}\Pi}), and in turn, the error probability in indepedence testing.

III Problem Setup and Main Results

We consider estimating the density function at a given point, where the density is assumed to be Hölder continuous in a neighborhood of that point. It is clear that there is no loss of generality assuming such neighborhood to be the unit cube, and that the given point is its center. More precisely, the class of densities under consideration is defined as follows:

Definition 3.

Given d{1,2,}d\in\{1,2,\dots\}, L>0L>0, A[0,1)A\in[0,1), and β>0\beta>0, let (β,L,A)\mathcal{H}(\beta,L,A) be the set of all probability density pXYp_{XY} on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} (where 𝒳=𝒴=d\mathcal{X}=\mathcal{Y}=\mathbb{R}^{d}) satisfying

pX(x),pY(y)A,x,y[0,1]d,\displaystyle p_{X}(x),p_{Y}(y)\geq A,\quad\forall x,y\in[0,1]^{d}, (23)

and

pXY[0,1]2d,βL.\displaystyle\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L. (24)
Definition 4.

We say 𝒞\mathcal{C} is a prefix code [14] if it is a subset of the set of all finite non-empty binary sequences satisfying the property that for any distinct s1,s2𝒞s_{1},s_{2}\in\mathcal{C}, s1s_{1} cannot be a prefix of s2s_{2}.

The problem is to estimate the density at a given point of an unknown distribution from (β,L,A)\mathcal{H}(\beta,L,A). More precisely,

  • PXYP_{XY} is a fixed but unknown distribution whose corresponding density pXYp_{XY} belongs to (β,L,A)\mathcal{H}(\beta,L,A) for some β(0,)\beta\in(0,\infty), L(0,)L\in(0,\infty), and A[0,1)A\in[0,1).

  • Infinite sequence of pairs (X(1),Y(1))(X(1),Y(1)), (X(2),Y(2))(X(2),Y(2)),…are i.i.d. according to PXYP_{XY}. Alice (Terminal 1) observes 𝐗=(X(l))l=1{\bf X}=(X(l))_{l=1}^{\infty} and Bob (Terminal 2) observes 𝐘=(Y(l))l=1{\bf Y}=(Y(l))_{l=1}^{\infty}.

  • Unlimited common randomness Π0\Pi_{0} is observed by both Alice and Bob. That is, an infinite random bit string independent of (𝐗,𝐘)\bf(X,Y) shared by Alice and Bob.

  • For i=1,,ri=1,\dots,r (rr is an integer), if ii is odd, then Alice sends to Bob a message Πi\Pi_{i}, which is an element in a prefix code, where Πi\Pi_{i} is computed using the common randomness Π0\Pi_{0}, the previous transcripts Πi1=(Π1,,Πi1)\Pi^{i-1}=(\Pi_{1},\dots,\Pi_{i-1}), and 𝐗{\bf X}; if ii is even, then Bob sends to Alice a message Πi\Pi_{i} computed using Π0\Pi_{0}, Πi1\Pi^{i-1}, and 𝐘{\bf Y}.

  • Bob computes an estimate p^\hat{p} of the true density pXY(x0,y0)p_{XY}(x_{0},y_{0}), where x0=y0x_{0}=y_{0} is the center of [0,1]d[0,1]^{d}.

One-way NP Estimation Problem. Suppose that r=1r=1. Under the constraint on the expected length of the transcript (i.e. length of the bit string)

𝔼[|Πr|]k,\displaystyle\mathbb{E}[|\Pi^{r}|]\leq k, (25)

where k>0k>0 is a real number, what is the minimax risk

R(k):=minp^,𝚷maxpXY(β,L,A)𝔼[|p^pXY(x0,y0)|2]?\displaystyle R(k):=\min_{\hat{p},\bf\Pi}\max_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]? (26)

Interactive NP Estimation Problem. Under the same constraint on the expected length of the transcript, but without any constraint on the number of rounds rr, what is the minimax risk?

Remark 1.

The prefix condition ensures that Bob knows that the current round has terminated after finishing reading each Πi\Pi_{i}. Alternatively, the problem can be formulated by stating that Πi\Pi_{i} is a random variable in an arbitrary alphabet, and replacing (25) by the entropy constraint H(Πr)kH(\Pi^{r})\leq k. Furthermore, one may use the information leakage constraint I(𝐗,𝐘;Πr)kI({\bf X,Y};\Pi^{r})\leq k instead. From our proofs it is clear that the minimax rates will not change under these alternative formulations.

Remark 2.

There would be no essential difference if the problem were formulated with |Π|k|\Pi|\leq k almost surely and |p^pXY(x0,y0)|2R(k)|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}\leq R(k) with probability (say) at least 1/21/2. Indeed, for the upper bound direction, those conditions are satisfied with a truncation argument, once we have an algorithm satisfying 𝔼[|Π|]k/4\mathbb{E}[|\Pi|]\leq k/4 and 𝔼[|p^pXY(x0,y0)|2]R(k)/4\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq R(k)/4, by Markov’s inequality and the union bound, therefore results only differ with a constant factor. For the lower bound, the proof can be extended to the high probability version, since we used the Le Cam style argument [51].

Remark 3.

The common randomness assumption is common in the communication complexity literature, and, in some sense, is equivalent to private randomness [34]. In our upper bound proof, the common randomness is the randomness in the codebooks. Random codebooks give rise to convenient properties, such as the fact that the expectation of the distribution of the matched codewords equals exactly the product of idealized single-letter distributions (82). It is likely, however, that some approximate versions of these proofs steps, and ultimately the same asymptotic risk, should hold for some carefully designed deterministic codebooks.

Theorem 1.

In one-way NP estimation, for any β(0,)\beta\in(0,\infty), L(0,)L\in(0,\infty), and A[0,1)A\in[0,1),

R(k)=Θ((klogk)2βd+2β)\displaystyle R(k)=\Theta(\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}}) (27)

where Θ()\Theta(\cdot) hides multiplicative factors depending on LL, β\beta, and AA.

The proof of the upper bound is in Section VII-B. Recall that nonparametric density estimation using a rectangular kernel is equivalent to counting the frequency of samples in a neighborhood of a given diameter, the bandwidth, which we denote as Δ\Delta. A naive protocol is for Alice to send the indices of samples in x0+[Δ,Δ]dx_{0}+[-\Delta,\Delta]^{d}. Locating each sample in that neighborhood requires on average Θ(log1Δ)=Θ(logk)\Theta(\log\frac{1}{\Delta})=\Theta(\log k) bits. Thus Θ(k/logk)\Theta(k/\log k) samples in that neighborhood can be located. It turns out that the naive protocol is asymptotically optimal.

The proof of the lower bound (Section VIII) follows by a reduction to testing independence for biased Bernoulli distributions, via a simulation argument. Although some arguments are similar to [19], the present problem concerns biased Bernoulli distributions instead. The (KL) strong data processing constant turns out to be drastically different from the χ2\chi^{2} data processing constant, as opposed to the cases of many familiar distributions such as the unbiased Bernoulli or the Gaussian distributions.

As alluded, our main result is that the risk can be strictly improved when interactions are allowed:

Theorem 2.

In interactive NP estimation, for any β(0,)\beta\in(0,\infty), L(0,)L\in(0,\infty), and A[0,1)A\in[0,1), we have

R(k)=Θ(k2βd+2β)\displaystyle R(k)=\Theta\left(k^{-\frac{2\beta}{d+2\beta}}\right) (28)

where Θ()\Theta(\cdot) hides multiplicative factors depending on LL, β\beta and AA.

To achieve the scaling in (28), rr can grow as slowly as the super-logarithm (i.e., inverse tetration) of kk; for the precise relation between rr and kk, see Section V-C.

The proof of the upper bound of Theorem 2 is given in Section VII-C, which is based on a novel multi-round estimation scheme for biased Bernoulli distributions formulated and analyzed in Sections IV,V,VI. Roughly speaking, the intuition is to “locate” the samples within neighborhoods of (x0,y0)(x_{0},y_{0}) by successive refinements, which is more communication-efficient than revealing the location at once.

The lower bound of Theorem 2 is proved in Section IX. The main technical hurdle is to develop new and tighter bounds on the symmetric data processing constant in [32] for the biased binary cases.

IV Estimation of Biased Bernoulli Distributions

In this section, we shall describe an algorithm for estimating the joint distribution of a pair of biased Bernoulli random variables. The biased Bernoulli estimation problem can be viewed as a natural generalization of the Gap hamming problem [24][12] to the sparse setting, and is the key component in both the upper and lower bound analysis for the nonparametric estimation problem. Indeed, we shall explain in Section VII that our nonparametric estimator is based on a linear combination of rectangle kernel estimators, which estimate the probability that XX and YY fall into neighborhoods of x0x_{0} and y0y_{0}. Indicators that samples are within such neighborhoods are Bernoulli variables, so that the biased Bernoulli estimator can be used. For the lower bound, we shall explain in Section VIII that the nonparametric estimation problem can be reduced to the biased Bernoulli estimation problem via a simulation argument.

For notational simplicity, we shall use X,YX,Y for the Bernoulli variables in this section as well as Sections V-VI, although we should keep in mind that these are not the continuous variables in the original nonparametric estimation problem.

Bernoulli Estimation Problem:

  • Fixed real numbers m1,m2(10,)m_{1},m_{2}\in(10,\infty), and an unknown δ[1,min{m1,m2}1]\delta\in[-1,\min\{m_{1},m_{2}\}-1].

  • (𝐗,𝐘)=(X(l),Y(l))i=1{\bf(X,Y)}=(X(l),Y(l))_{i=1}^{\infty} i.i.d. according to the distribution

    PXY(δ):=\displaystyle P^{(\delta)}_{XY}:=
    (1m1m2(1+δ)1m1(11m2)δm1m21m2(11m1)δm1m2(11m1)(11m2)+δm1m2)\displaystyle\left(\begin{array}[]{cc}\frac{1}{m_{1}m_{2}}(1+\delta)&\frac{1}{m_{1}}(1-\frac{1}{m_{2}})-\frac{\delta}{m_{1}m_{2}}\\ \frac{1}{m_{2}}(1-\frac{1}{m_{1}})-\frac{\delta}{m_{1}m_{2}}&(1-\frac{1}{m_{1}})(1-\frac{1}{m_{2}})+\frac{\delta}{m_{1}m_{2}}\end{array}\right) (31)

    where we recall our convention that the upper left entry of the matrix denotes the probability that X=Y=0X=Y=0. Alice observes (X(l))l=1(X(l))_{l=1}^{\infty} and Bob observes (Y(l))l=1(Y(l))_{l=1}^{\infty}.

  • Unlimited common randomness Π0\Pi_{0}.

Goal: Alice and Bob exchange messages in no more than rr rounds in order to estimate δ\delta.

Our algorithm is described as follows:

Input. m1,m2(10,)m_{1},m_{2}\in(10,\infty); positive integer nn and rr; a sequence of real numbers α1,,αr(1,)\alpha_{1},\dots,\alpha_{r}\in(1,\infty) satisfying

1iroddαim110;\displaystyle\prod_{1\leq i\leq r}^{\rm odd}\alpha_{i}\leq\frac{m_{1}}{10}; (32)
1irevenαim210.\displaystyle\prod_{1\leq i\leq r}^{\rm even}\alpha_{i}\leq\frac{m_{2}}{10}. (33)

The α1,,αr\alpha_{1},\dots,\alpha_{r} can be viewed as parameters of the algorithm, and controls how much information is revealed about the locations of “common zeros” of 𝐗,𝐘{\bf X,Y} in each round of communication. For example, setting α1=m110\alpha_{1}=\frac{m_{1}}{10} and all other αi=1\alpha_{i}=1 yields a one-way communication protocol, whereas setting all αi>1\alpha_{i}>1 yields a “successive refinement” algorithm which may incur smaller communication budget yet convey the same amount of information.

Before describing the algorithm, let us define a conditional distribution PUr|XYP_{U^{r}|XY} by recursion, which will be used later in generation random codebooks.

Definition 5.

For each i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z}, define

PUi|X=0,Ui1=𝟎\displaystyle P_{U_{i}|X=0,U^{i-1}={\bf 0}} =[1,0];\displaystyle=[1,0]; (34)
PUi|X=1,Ui1=𝟎\displaystyle P_{U_{i}|X=1,U^{i-1}={\bf 0}} =[αi1,1αi1];\displaystyle=[\alpha_{i}^{-1},1-\alpha_{i}^{-1}]; (35)
PUi|X=0,Ui1𝟎\displaystyle P_{U_{i}|X=0,U^{i-1}\neq{\bf 0}} =PUi|X=1,Ui1𝟎=[0,1].\displaystyle=P_{U_{i}|X=1,U^{i-1}\neq{\bf 0}}=[0,1]. (36)

Then set PUi|XYUi1=PUi|XUi1P_{U_{i}|XYU^{i-1}}=P_{U_{i}|XU^{i-1}}. For i=1,,ri=1,\dots,r even, we use similar definitions, but with the roles of XX and YY switched. This specifies PUi|XYUi1P_{U_{i}|XYU^{i-1}}, i=1,,ri=1,\dots,r.

Note that by Definition 5, Ui=1U_{i}=1 implies Ui+1=1U_{i+1}=1 for each i=1,,r1i=1,\dots,r-1. In words, for ii odd, UiU_{i} marks all X=0X=0 as 0, and marks X=1X=1 as either 0 or 11; whenever Ui=1U_{i}=1 is marked, then XX is definitely 1, and will be forgotten in all subsequent rounds. Now set

PXYUr(δ):=PUr|XYPXY(δ).\displaystyle P_{XYU^{r}}^{(\delta)}:=P_{U^{r}|XY}P^{(\delta)}_{XY}. (37)

where PUr|XYP_{U^{r}|XY} is induced by (PUi|XYUi1)i=1r(P_{U_{i}|XYU^{i-1}})_{i=1}^{r} in Definition 5.

Initialization. By applying a common function to the common randomness, Alice and Bob can produce a shared infinite array (Vi,j(l))(V_{i,j}(l)), where i{1,,r}i\in\{1,\dots,r\}, j{1,2,}j\in\{1,2,\dots\}, l{1,2,,n}l\in\{1,2,\dots,n\}, such that the entries in the array are independent random variables, with Vi,j(l)Bern(1αi1)V_{i,j}(l)\sim{\rm Bern}(1-\alpha_{i}^{-1}). Also set

U0(l)=0,l=1,,n.\displaystyle U_{0}(l)=0,\quad\forall l=1,\dots,n. (38)

Iterations. Consider any i=1,,ri=1,\dots,r, where ii is odd. We want to generate 𝐔i{\bf U}_{i} by selecting a codeword so that (𝐗,𝐘,𝐔i)({\bf X,Y},{\bf U}^{i}) follows the distribution of (PXYUi(δ))n(P_{XYU^{i}}^{(\delta)})^{\otimes n}, where 𝐔i1{\bf U}^{i-1} is defined in previous rounds. Define

𝒜0\displaystyle\mathcal{A}_{0} :={ln:X(l)=0,Ui1(l)=0};\displaystyle:=\{l\leq n\colon X(l)=0,U_{i-1}(l)=0\}; (39)
𝒜1\displaystyle\mathcal{A}_{1} :={ln:X(l)=1,Ui1(l)=0};\displaystyle:=\{l\leq n\colon X(l)=1,U_{i-1}(l)=0\}; (40)
𝒜\displaystyle\mathcal{A} :={ln:Ui1(l)=0}.\displaystyle:=\{l\leq n\colon U_{i-1}(l)=0\}. (41)

Note that Alice knows both 𝒜0\mathcal{A}_{0} and 𝒜1\mathcal{A}_{1}, while Bob knows 𝒜\mathcal{A}, since it will be seen from the recursion that Alice and Bob both know 𝐔1,,𝐔i1{\bf U}_{1},\dots,{\bf U}_{i-1} at the beginning of the ii-th round. Alice chooses j^i\hat{j}_{i} as the minimum nonnegative integer jj such that

Vi,j(l)=0,l𝒜0.\displaystyle V_{i,j}(l)=0,\quad\forall l\in\mathcal{A}_{0}. (42)

Alice encodes j^i\hat{j}_{i} using a prefix code, e.g.  Elias gamma code [17], and sends it to Bob. Then both Alice and Bob compute 𝐔i=(Ui(l))l=1n{0,1}n{\bf U}_{i}=(U_{i}(l))_{l=1}^{n}\in\{0,1\}^{n} by

Ui(l)\displaystyle U_{i}(l) :=V(i,j^i)(l),l𝒜;\displaystyle:=V_{(i,\hat{j}_{i})}(l),\quad\forall l\in\mathcal{A}; (43)
Ui(l)\displaystyle U_{i}(l) :=1,l{1,,n}𝒜.\displaystyle:=1,\quad\forall l\in\{1,\dots,n\}\setminus\mathcal{A}. (44)

The operations in the ii-th round for even ii is similar, with the roles of Alice and Bob reversed. We will see later that the notation 𝐔i{\bf U}_{i} is consistent in the sense of (53).

Estimator. Recall that in classical parametric statistics, one can evaluate the score function at the sample, compute its expectation and variation, and construct an estimator achieving the Cramer-Rao bound asymptotically. Now for i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z}, define the score function

Γi(ui,y)\displaystyle\Gamma_{i}(u^{i},y) :=δlnPUi|YUi1(δ)(ui|y,ui1)|δ=0\displaystyle:=\frac{\partial}{\partial\delta}\left.\ln P^{(\delta)}_{U_{i}|YU^{i-1}}(u_{i}|y,u^{i-1})\right|_{\delta=0} (45)
={δlnPUi|YUi1(δ)(ui|y,𝟎)|δ=0if ui1=𝟎0otherwise\displaystyle=\left\{\begin{array}[]{ccc}\frac{\partial}{\partial\delta}\left.\ln P^{(\delta)}_{U_{i}|YU^{i-1}}(u_{i}|{y,\bf 0})\right|_{\delta=0}&\textrm{if }u^{i-1}={\bf 0}\\ 0&\textrm{otherwise}\end{array}\right. (48)

where PUi|YUi1(δ)P^{(\delta)}_{U_{i}|YU^{i-1}} is induced by PXYUr(δ)P_{XYU^{r}}^{(\delta)}. For i{1,,r}2i\in\{1,\dots,r\}\cup 2\mathbb{Z}, define Γi(ui,x)\Gamma_{i}(u^{i},x) similarly with the roles of XX and YY reversed. Alice and Bob can each compute

ΓA:=1irevenl=1nΓi(Ui(l),X(l))\displaystyle\Gamma^{\rm A}:=\sum_{1\leq i\leq r}^{\rm even}\sum_{l=1}^{n}\Gamma_{i}(U^{i}(l),X(l)) (49)

and

ΓB:=1iroddl=1nΓi(Ui(l),Y(l))\displaystyle\Gamma^{\rm B}:=\sum_{1\leq i\leq r}^{\rm odd}\sum_{l=1}^{n}\Gamma_{i}(U^{i}(l),Y(l)) (50)

respectively. Finally, Alice’s and Bob’s estimators are given by

δ^A\displaystyle\hat{\delta}^{\rm A} :=ΓA(δ𝔼(δ)[ΓA])1;\displaystyle:=\Gamma^{\rm A}\cdot\left(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm A}]\right)^{-1}; (51)
δ^B\displaystyle\hat{\delta}^{\rm B} :=ΓB(δ𝔼(δ)[ΓB])1,\displaystyle:=\Gamma^{\rm B}\cdot\left(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]\right)^{-1}, (52)

where 𝔼(δ)\mathbb{E}^{(\delta)} refers to expectation when the true parameter is δ\delta, and δ\partial_{\delta} denotes the derivative in δ\delta. We will show that these estimators are well-defined: δ𝔼(δ)[ΓA]\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm A}] and δ𝔼(δ)[ΓB]\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}] are independent of δ\delta (Lemma 3), and can be computed by Alice and Bob without knowing δ\delta.

Lemma 2.

For each i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z}, conditioned on 𝐗,𝐘,𝐔1,,𝐔i1{\bf X,Y},{\bf U}_{1},\dots,{\bf U}_{i-1}, we have

𝐔iPUi|XUi1n(|𝐗,𝐔1,,𝐔i1),\displaystyle{\bf U}_{i}\sim P^{\otimes n}_{U_{i}|XU^{i-1}}(\cdot|{\bf X},{\bf U}_{1},\dots,{\bf U}_{i-1}), (53)

where PUi|XUi1P_{U_{i}|XU^{i-1}} is as defined in (34)-(36). A similar relation holds for even ii.

Proof.

Immediate from (43)-(44). ∎

Lemma 3.

𝔼(δ)[ΓA]\mathbb{E}^{(\delta)}[\Gamma^{\rm A}] and 𝔼(δ)[ΓB]\mathbb{E}^{(\delta)}[\Gamma^{\rm B}] are linear in δ\delta.

Proof.

By (53),

𝔼(δ)[Γi(Ui(l)),Y(l)]\displaystyle\quad\mathbb{E}^{(\delta)}[\Gamma_{i}(U^{i}(l)),Y(l)] (54)
=ur,x,yΓi(ui,y)PXY(δ)(x,y)PUr|XY(ur|x,y)\displaystyle=\sum_{u^{r},x,y}\Gamma_{i}(u^{i},y)P^{(\delta)}_{XY}(x,y)P_{U^{r}|XY}(u^{r}|x,y) (55)

for each ii odd and l{1,,n}l\in\{1,\dots,n\}, and similar expressions hold for ii even. The claims then follow. ∎

Theorem 3.

δ^A\hat{\delta}^{\rm A} and δ^B\hat{\delta}^{\rm B} are unbiased estimators.

Proof.

For ii odd, by (45) we have uiΓi(ui,y)PUi|YUi1(0)(ui|y,ui1)=0\sum_{u_{i}}\Gamma_{i}(u^{i},y)P^{(0)}_{U_{i}|YU^{i-1}}(u_{i}|y,u^{i-1})=0 for any (y,ui1)(y,u^{i-1}). Then 𝔼(0)[Γi(Ui(l)),Y(l)]=0\mathbb{E}^{(0)}[\Gamma_{i}(U^{i}(l)),Y(l)]=0 follows from (55). It follows that 𝔼(0)[δ^A]=𝔼(0)[δ^B]=0\mathbb{E}^{(0)}[\hat{\delta}^{\rm A}]=\mathbb{E}^{(0)}[\hat{\delta}^{\rm B}]=0, and unbiasedness is implied by Lemma 3. ∎

V Bounds on Information Exchanges

In this section we prove key auxiliary results that will be used in the upper bounds.

V-A General (αi)(\alpha_{i})

The following Theorem is crucial for the achievability part of the analysis of the Bernoulli estimation problem described in Section IV (and hence for the nonparametric estimation problem). Specifically, (56)-(57) bounds the communication from Alice to Bob and in reverse, and (58)-(59) bounds the information exchanged which, in turn, will bound the estimation risk via Fano’s inequality.

Theorem 4.

Consider any m1,m2>10m_{1},m_{2}>10, α1,,αr(1,)\alpha_{1},\dots,\alpha_{r}\in(1,\infty) satisfying (32)-(33), and PUrXY(δ)P_{U^{r}XY}^{(\delta)} as in (37). We have

1iroddPXUi1(0)(0,𝟎)logαi\displaystyle\sum_{1\leq i\leq r}^{\rm odd}P^{(0)}_{XU^{i-1}}(0,{\bf 0})\log\alpha_{i} 1.1m11iroddlogαi2ji1evenαj1;\displaystyle\leq\frac{1.1}{m_{1}}\sum_{1\leq i\leq r}^{\rm odd}\log\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}; (56)
1irevenPYUi1(0)(0,𝟎)logαi\displaystyle\sum_{1\leq i\leq r}^{\rm even}P^{(0)}_{YU^{i-1}}(0,{\bf 0})\log\alpha_{i} 1.1m21irevenlogαi1ji1oddαj1,\displaystyle\leq\frac{1.1}{m_{2}}\sum_{1\leq i\leq r}^{\rm even}\log\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}, (57)

and assuming the natural base of logarithms,

limδ0δ21iroddI(Ui;Y|Ui1)\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1}) 15m12m21jroddαj;\displaystyle\geq\frac{1}{5m_{1}^{2}m_{2}}\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}; (58)
limδ0δ21irevenI(Ui;X|Ui1)\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X|U^{i-1}) 15m1m221jrevenαj.\displaystyle\geq\frac{1}{5m_{1}m_{2}^{2}}\prod_{1\leq j\leq r}^{\rm even}\alpha_{j}. (59)

The proof can be found in Appendix A.

Remark 4.

Since

PXUi1(0)(0,𝟎)logαi\displaystyle\quad P^{(0)}_{XU^{i-1}}(0,{\bf 0})\log\alpha_{i}
=PXUi1(0)(0,𝟎)D(PUi|X=0,Ui1=𝟎PUi|X=1,Ui1=𝟎)\displaystyle=P^{(0)}_{XU^{i-1}}(0,{\bf 0})D(P_{U_{i}|X=0,U^{i-1}={\bf 0}}\|P_{U_{i}|X=1,U^{i-1}={\bf 0}}) (60)
PUi1(0)(𝟎)infQ[PX|Ui1(0)(0|𝟎)D(PUi|X=0,Ui1=𝟎Q)\displaystyle\geq P^{(0)}_{U^{i-1}}({\bf 0})\inf_{Q}\left[P^{(0)}_{X|U^{i-1}}(0|{\bf 0})D(P_{U_{i}|X=0,U^{i-1}={\bf 0}}\|Q)\right.
+PX|Ui1(0)(1|𝟎)D(PUi|X=1,Ui1=𝟎Q)]\displaystyle\quad+\left.P^{(0)}_{X|U^{i-1}}(1|{\bf 0})D(P_{U_{i}|X=1,U^{i-1}={\bf 0}}\|Q)\right] (61)
=PUi1(0)(𝟎)I(Ui;X|Ui1=𝟎)\displaystyle=P^{(0)}_{U^{i-1}}({\bf 0})I(U_{i};X|U^{i-1}={\bf 0}) (62)
=I(Ui;X|Ui1),\displaystyle=I(U_{i};X|U^{i-1}), (63)

Theorem 4 also implies the following bound on the external information (see (18)):

I(Ur;XY)\displaystyle I(U^{r};XY) 1.1m11iroddlogαi2ji1evenαj1\displaystyle\leq\frac{1.1}{m_{1}}\sum_{1\leq i\leq r}^{\rm odd}\log\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}
+1.1m21irevenlogαi1ji1oddαj1.\displaystyle\quad+\frac{1.1}{m_{2}}\sum_{1\leq i\leq r}^{\rm even}\log\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}. (64)
Remark 5.

Let us provide some intuition why interaction helps, assuming the case of m1=m2=mm_{1}=m_{2}=m for simplicity. From the proof of Theorem 4, it can be seen that up to a constant factor, s(X;Y)s_{\infty}^{*}(X;Y) is at least δ2m30lnm100etdt(1m0lnm100etdt)1δ2m\frac{\delta^{2}}{m^{3}}\int_{0}^{\ln\frac{m}{100}}e^{t}{\rm d}t\left(\frac{1}{m}\int_{0}^{\ln\frac{m}{100}}e^{-t}{\rm d}t\right)^{-1}\sim\frac{\delta^{2}}{m} (which is in fact sharp as will be seen from the upper bound on s(X;Y)s_{\infty}^{*}(X;Y) in Theorem 6). Moreover, lower bounds on sr(X;Y)s_{r}^{*}(X;Y) can be computed by replacing the integrals with discrete sums with rr terms:

δ2m3i=1r/2(etieti1)(1mi=1r/2eti1(titi1))1\displaystyle\frac{\delta^{2}}{m^{3}}\sum_{i=1}^{\lceil r/2\rceil}(e^{t_{i}}-e^{t_{i-1}})\left(\frac{1}{m}\sum_{i=1}^{\lceil r/2\rceil}e^{-t_{i-1}}(t_{i}-t_{i-1})\right)^{-1} (65)

where 1=t0<t1<<tr/2=lnm1001=t_{0}<t_{1}<\dots<t_{\lceil r/2\rceil}=\ln\frac{m}{100}. In particular, when r=1r=1, we recover s1(X;Y)δ2mlnms_{1}^{*}(X;Y)\sim\frac{\delta^{2}}{m\ln m}, whereas choosing titi1=1t_{i}-t_{i-1}=1, i=1,,r/2i=1,\dots,\lceil r/2\rceil shows that rlnmr\sim\ln m achieves sr(X;Y)δ2ms_{r}^{*}(X;Y)\sim\frac{\delta^{2}}{m}. Even better, later we will take tkt_{k} as the kk-th iterated power of 22, and then rr will be the super logarithm of mm.

Recall that (αi)(\alpha_{i}) control the amount of information revealed in each round and serve as hyperparameters of the algorithm to be tuned. Next we shall explain how to select the value of (αi)(\alpha_{i}) so that the optimal performance is achieved in the one-way and interactive settings.

V-B r=1r=1 Case

Specializing Theorem 4 we obtain:

Corollary 4.

For any m1,m2>10m_{1},m_{2}>10, with r=1r=1 and α1=m110\alpha_{1}=\frac{m_{1}}{10} we have

PX(0)(0)logα1\displaystyle P^{(0)}_{X}(0)\log\alpha_{1} 1.1m1logm110;\displaystyle\leq\frac{1.1}{m_{1}}\log\frac{m_{1}}{10}; (66)
limδ0δ2I(U1;Y)\displaystyle\lim_{\delta\to 0}\delta^{-2}I(U_{1};Y) 150m1m2.\displaystyle\geq\frac{1}{50m_{1}m_{2}}. (67)

V-C r=r=\infty Case

Denote by 2n{}^{n}2 the nn-th tetration of 2, which is defined recursively by 20=1{}^{0}2=1 and

2n:=2(2(n1)),n1.\displaystyle{}^{n}2:=2^{\left({}^{(n-1)}2\right)},\quad\forall n\geq 1. (68)

Let m:=min{m1,m2}m:=\min\{m_{1},m_{2}\}, and let r0r_{0} be the minmum integer such that

expe(2r01)m10.\displaystyle\exp_{e}({}^{r_{0}}2-1)\geq\frac{m}{10}. (69)

For m>10m>10 we have r01r_{0}\geq 1. Then we set

r\displaystyle r :=2r0;\displaystyle:=2r_{0}; (70)
α2k1\displaystyle\alpha_{2k-1} :=α2k:=expe(2k2(k1)),k{1,,r01};\displaystyle:=\alpha_{2k}:=\exp_{e}({}^{k}2-{}^{(k-1)}2),\quad\forall k\in\{1,\dots,r_{0}-1\}; (71)
α2r01\displaystyle\alpha_{2r_{0}-1} :=α2r0=m10expe(12(r01)),\displaystyle:=\alpha_{2r_{0}}=\frac{m}{10}\exp_{e}(1-{}^{(r_{0}-1)}2), (72)

which fulfills αi>1\alpha_{i}>1. We see that

1iroddlnαi2ji1evenαj1\displaystyle\quad\sum_{1\leq i\leq r}^{\rm odd}\ln\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}
k=1r0(2k2(k1))expe(12(k1))\displaystyle\leq\sum_{k=1}^{r_{0}}\left({}^{k}2-{}^{(k-1)}2\right)\exp_{e}\left(1-{}^{(k-1)}2\right) (73)
ek=12kexpe(2(k1))\displaystyle\leq e\sum_{k=1}^{\infty}{}^{k}2\exp_{e}\left(-{}^{(k-1)}2\right) (74)
=ek=1expe((1log2)2(k1))\displaystyle=e\sum_{k=1}^{\infty}\exp_{e}\left(-(1-\log 2)\cdot{}^{(k-1)}2\right) (75)
<5.\displaystyle<5. (76)

The first inequality above follows by αr1=m10expe(12(r01))expe(2r02(r01))\alpha_{r-1}=\frac{m}{10}\exp_{e}(1-{}^{(r_{0}-1)}2)\leq\exp_{e}({}^{r_{0}}2-{}^{(r_{0}-1)}2). Similarly we also have 1irevenlnαi1ji1oddαj1<5\sum_{1\leq i\leq r}^{\rm even}\ln\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}<5. Moreover,

1jroddαj=1jrevenαj=m10.\displaystyle\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}=\prod_{1\leq j\leq r}^{\rm even}\alpha_{j}=\frac{m}{10}. (77)

Summarizing, we have

Corollary 5.

Consider m1,m2>10m_{1},m_{2}>10, m:=min{m1,m2}m:=\min\{m_{1},m_{2}\}, and (αi)(\alpha_{i}) defined in (70)-(72). We have

1iroddPXUi1(0)(0,𝟎)lnαi\displaystyle\sum_{1\leq i\leq r}^{\rm odd}P^{(0)}_{XU^{i-1}}(0,{\bf 0})\ln\alpha_{i} 6m1;\displaystyle\leq\frac{6}{m_{1}}; (78)
1irevenPYUi1(0)(0,𝟎)lnαi\displaystyle\sum_{1\leq i\leq r}^{\rm even}P^{(0)}_{YU^{i-1}}(0,{\bf 0})\ln\alpha_{i} 6m2;\displaystyle\leq\frac{6}{m_{2}}; (79)
limδ0δ21iroddI(Ui;Y|Ui1)\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1}) m50m12m2;\displaystyle\geq\frac{m}{50m_{1}^{2}m_{2}}; (80)
limδ0δ21irevenI(Ui;X|Ui1)\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X|U^{i-1}) m50m1m22.\displaystyle\geq\frac{m}{50m_{1}m_{2}^{2}}. (81)

where r=2r0r=2r_{0} and r0r_{0} is defined in (69).

Let us remark that the sequence (αi)(\alpha_{i}) we used in (70)-(72) is essentially optimal: Let βk:=2j2kevenαj1\beta_{k}:=\prod_{2\leq j\leq 2k}^{\rm even}\alpha_{j}^{-1}. In order for (73) to converge, we need kln(βkβk1)βk11\sum_{k}\ln(\frac{\beta_{k}}{\beta_{k-1}})\beta_{k-1}^{-1} to be convergent. Therefore βk\beta_{k} cannot grow faster than βk=exp(βk1)\beta_{k}=\exp(\beta_{k-1}) which is tetration. However this only amounts to a lower bound on rr for a particular design of PUr|XYP_{U^{r}|XY} in Definition 5. Since tetration grows super fast, from a practical viewpoint rr is essentially a constant. Nevertheless, it remains an interesting theoretical question whether rr needs to diverge:

Conjecture 1.

If there is an algorithm (indexed by kk) achieving the optimal risk (28) for nonparametric estimation, then necessarily rr\to\infty as kk\to\infty.

VI Achievability Bounds for Bernoulli Estimation

In this section we analyze the performance of the Bernoulli distribution estimation algorithm described in Section IV.

VI-A Communication Complexity

Consider any i{1,,r}i\in\{1,\dots,r\}. Denoting by P^𝐗𝐘𝐔i\widehat{P}_{{\bf XY}{\bf U}^{i}} the empirical distribution of (X(l),Y(l),U1(l),,Ui(l))l=1n(X(l),Y(l),U_{1}(l),\dots,U_{i}(l))_{l=1}^{n}, we have from (53) that

𝔼(δ)[P^𝐗𝐘𝐔i|𝐗,𝐘,𝐔i1]=P^𝐗𝐘𝐔i1PUi|XUi1.\displaystyle\mathbb{E}^{(\delta)}[\widehat{P}_{{\bf XY}{\bf U}^{i}}|{\bf X,Y,}{\bf U}^{i-1}]=\widehat{P}_{{\bf XY}{\bf U}^{i-1}}P_{U_{i}|XU^{i-1}}. (82)

In particular,

𝔼(δ)[P^𝐗𝐘𝐔r]=PXY(δ)PUr|XY.\displaystyle\mathbb{E}^{(\delta)}[\widehat{P}_{{\bf XY}{\bf U}^{r}}]=P_{XY}^{(\delta)}P_{U^{r}|XY}. (83)

Let (j^i):=2log2(j^i)+1\ell(\hat{j}_{i}):=2\lfloor\log_{2}(\hat{j}_{i})\rfloor+1 be the number of bits need to encode the positive integer j^i\hat{j}_{i} using the Elias gamma code [17]. For each i{1,,r}2i\in\{1,\dots,r\}\cap 2\mathbb{Z} we have

𝔼(δ)[(j^i)|𝐗,𝐘,𝐔i1]\displaystyle\mathbb{E}^{(\delta)}[\ell(\hat{j}_{i})|{\bf X,Y,}{\bf U}^{i-1}] 2𝔼(δ)[log2j^i|𝐗,𝐘,𝐔i1]+1\displaystyle\leq 2\mathbb{E}^{(\delta)}[\log_{2}\hat{j}_{i}|{\bf X,Y,}{\bf U}^{i-1}]+1 (84)
2log2𝔼(δ)[j^i|𝐗,𝐘,𝐔i1]+1\displaystyle\leq 2\log_{2}\mathbb{E}^{(\delta)}[\hat{j}_{i}|{\bf X,Y,}{\bf U}^{i-1}]+1 (85)
=2log2αinP^𝐗𝐔i1(0,𝟎)+1\displaystyle=2\log_{2}\alpha_{i}^{n\widehat{P}_{{\bf XU}_{i-1}}(0,{\bf 0})}+1 (86)
=2nP^𝐗𝐔i1(0,𝟎)log2αi+1\displaystyle=2n\widehat{P}_{{\bf XU}_{i-1}}(0,{\bf 0})\log_{2}\alpha_{i}+1 (87)

where (86) follows from the selection rule (42) and the formula of expectation of the geometric distribution. Then

𝔼(δ)[(j^i)]\displaystyle\mathbb{E}^{(\delta)}[\ell(\hat{j}_{i})] 2nPXUi1(δ)(0,𝟎)log2αi+1\displaystyle\leq 2nP_{XU_{i-1}}^{(\delta)}(0,{\bf 0})\log_{2}\alpha_{i}+1 (88)
2(1+δ)nPXUi1(0)(0,𝟎)log2αi+1\displaystyle\leq 2(1+\delta)nP_{XU_{i-1}}^{(0)}(0,{\bf 0})\log_{2}\alpha_{i}+1 (89)

where (88) used (83); (89) used the fact that PXYUr(δ)P_{XYU^{r}}^{(\delta)} is dominated by (1+δ)PXYUr(0)(1+\delta)P_{XYU^{r}}^{(0)}. Note that 𝟎{\bf 0} in (88)-(89) denotes the value of the vector Ui1U^{i-1}.

VI-B Expectation of ΓB\Gamma^{\rm B}

Recall that ΓB\Gamma^{\rm B} was defined in (48). Pick arbitrary i{1,,r}2i\in\{1,\dots,r\}\setminus 2\mathbb{Z}. Since

PUiY(δ):=PYj=1iPUj|Uj1Y(δ)\displaystyle P_{U^{i}Y}^{(\delta)}:=P_{Y}\prod_{j=1}^{i}P^{(\delta)}_{U_{j}|U^{j-1}Y} (90)

and since PYP_{Y} and (PUj|Uj1Y(δ))j{1,,r}2(P^{(\delta)}_{U_{j}|U^{j-1}Y})_{j\in\{1,\dots,r\}\cap 2\mathbb{Z}} are independent of δ\delta, we obtain

δlnPUiY(δ)(ui,y)|δ=0=1jioddΓj(uj,y).\displaystyle\partial_{\delta}\ln P_{U^{i}Y}^{(\delta)}(u^{i},y)|_{\delta=0}=\sum_{1\leq j\leq i}^{\rm odd}\Gamma_{j}(u^{j},y). (91)

Next, observe that for any l{1,,n}l\in\{1,\dots,n\},

𝔼(0)[Γi(Ui(l),Y(l))|Ui1(l),Y(l)]\displaystyle\quad\mathbb{E}^{(0)}[\Gamma_{i}(U^{i}(l),Y(l))|U^{i-1}(l),Y(l)]
=𝔼(0)[δPUi|YUi1(δ)(Ui(l)|Y,Ui1(l))|δ=0PUi|YUi1(0)(Ui(l)|Y,Ui1(l))|Ui1(l),Y(l)]\displaystyle=\mathbb{E}^{(0)}\left[\left.\frac{\left.\partial_{\delta}P^{(\delta)}_{U_{i}|YU^{i-1}}(U_{i}(l)|Y,U^{i-1}(l))\right|_{\delta=0}}{P^{(0)}_{U_{i}|YU^{i-1}}(U_{i}(l)|Y,U^{i-1}(l))}\right|U^{i-1}(l),Y(l)\right] (92)
=uiδPUi|YUi1(δ)(ui|Y,Ui1(l))|δ=0\displaystyle=\sum_{u_{i}}\left.\partial_{\delta}P^{(\delta)}_{U_{i}|YU^{i-1}}(u_{i}|Y,U^{i-1}(l))\right|_{\delta=0} (93)
=0.\displaystyle=0. (94)

Moreover, for any δ0\delta\neq 0,

1δl=1n𝔼(δ)[Γi(U1(l),,Ui(l),Y(l))]\displaystyle\quad\frac{1}{\delta}\sum_{l=1}^{n}\mathbb{E}^{(\delta)}[\Gamma_{i}(U_{1}(l),\dots,U_{i}(l),Y(l))]
=δ1nui,yΓi(ui,y)PUiY(δ)(ui,y)\displaystyle=\delta^{-1}n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)P^{(\delta)}_{U^{i}Y}(u^{i},y) (95)
=nui,yΓi(ui,y)δPUiY(δ)(ui,y)|δ=0\displaystyle=n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)\frac{\partial}{\partial\delta}P_{U^{i}Y}^{(\delta)}(u^{i},y)|_{\delta=0} (96)
=nui,yΓi(ui,y)PUiY(0)(ui,y)1jioddΓj(uj,y)\displaystyle=n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)P_{U^{i}Y}^{(0)}(u^{i},y)\sum_{1\leq j\leq i}^{\rm odd}\Gamma_{j}(u^{j},y) (97)
=nui,yΓi2(ui,y)PUiY(0)(ui,y)\displaystyle=n\sum_{u^{i},y}\Gamma_{i}^{2}(u^{i},y)P_{U^{i}Y}^{(0)}(u^{i},y) (98)

where (95) used (83); (96) used (94) and the linearity of PUi1YδP^{\delta}_{U^{i-1}Y} in δ\delta; (97) used (91); (98) follows from (94). Thus

1δ𝔼(δ)[ΓB]=IB,δ0\displaystyle\frac{1}{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]=I^{\rm B},\quad\forall\delta\neq 0 (99)

where we defined

IB\displaystyle I^{\rm B} :=n1irodd𝔼(0)[Γi2(Ui(1),Y(1))].\displaystyle:=n\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}[\Gamma_{i}^{2}(U^{i}(1),Y(1))]. (100)
Lemma 6.

Let (Ui,Y)PUiY(δ)(U^{i},Y)\sim P^{(\delta)}_{U^{i}Y}. We have

IB2nlimδ0δ21iroddI(Ui;Y|Ui1)\displaystyle I^{\rm B}\geq 2n\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1}) (101)

where the logarithmic base of the mutual information is natural.

Proof.

Consider any i{1,2,,r}2i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}. we have

𝔼(0)[Γi2(Ui,Y)]\displaystyle\quad\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U^{i},Y)\right]
=𝔼(0)[(δlnPUi|YUi1(δ)(Ui|YUi1)|δ=0)2]\displaystyle=\mathbb{E}^{(0)}\left[\left(\partial_{\delta}\ln P^{(\delta)}_{U_{i}|YU^{i-1}}(U_{i}|YU^{i-1})|_{\delta=0}\right)^{2}\right] (102)
=2limδ0δ2D(PUi|YUi1(δ)PUi|YUi1(0)|PUi1Y(0))\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}|YU^{i-1}}^{(\delta)}\|P_{U_{i}|YU^{i-1}}^{(0)}|P_{U^{i-1}Y}^{(0)}) (103)
=2limδ0δ2D(PUi|YUi1(δ)PUi|Ui1(0)|PUi1Y(0))\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}|YU^{i-1}}^{(\delta)}\|P_{U_{i}|U^{i-1}}^{(0)}|P_{U^{i-1}Y}^{(0)}) (104)
=2limδ0δ2D(PUi|YUi1(δ)PUi|Ui1(0)|PUi1Y(δ))\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}|YU^{i-1}}^{(\delta)}\|P_{U_{i}|U^{i-1}}^{(0)}|P_{U^{i-1}Y}^{(\delta)}) (105)
2limδ0δ2D(PUi|YUi1(δ)PUi|Ui1(δ)|PUi1Y(δ))\displaystyle\geq 2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}|YU^{i-1}}^{(\delta)}\|P_{U_{i}|U^{i-1}}^{(\delta)}|P_{U^{i-1}Y}^{(\delta)}) (106)
=limδ0δ2I(Ui;Y;Ui1)\displaystyle=\lim_{\delta\to 0}\delta^{-2}I(U_{i};Y;U^{i-1}) (107)

where we defined the conditional KL divergence

D(PY|XQY|X|PX):=D(PY|X=xQY|X=x)𝑑PX(x);D(P_{Y|X}\|Q_{Y|X}|P_{X}):=\int D(P_{Y|X=x}\|Q_{Y|X=x})dP_{X}(x);

(104) follows since PUi|YUi1(0)=PUi|Ui1(0)P^{(0)}_{U_{i}|YU^{i-1}}=P^{(0)}_{U_{i}|U^{i-1}}; (105) follows since limδ0PUi1Y(δ)=PUi1Y(0)\lim_{\delta\to 0}P^{(\delta)}_{U^{i-1}Y}=P^{(0)}_{U^{i-1}Y}. ∎

VI-C Variance of ΓB\Gamma^{\rm B}

For any δ\delta, since (𝐔r,𝐗,𝐘)(PUrXY(δ))n({\bf U}^{r},{\bf X,Y})\sim(P^{(\delta)}_{U^{r}XY})^{\otimes n}, we have

var(δ)(ΓB)\displaystyle\operatorname{var}^{(\delta)}(\Gamma^{\rm B}) =l=1nvar(δ)(1iroddΓi(U1(l),,Ui(l),Y(l)))\displaystyle=\sum_{l=1}^{n}\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(l),\dots,U_{i}(l),Y(l))\right) (108)
=nvar(δ)(1iroddΓi(U1(1),,Ui(1),Y(1))).\displaystyle=n\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right). (109)

However,

var(δ)(1iroddΓi(U1(1),,Ui(1),Y(1)))\displaystyle\quad\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)
𝔼(δ)[(1iroddΓi(U1(1),,Ui(1),Y(1)))2]\displaystyle\leq\mathbb{E}^{(\delta)}\left[\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)^{2}\right] (110)
(1+δ)𝔼(0)[(1iroddΓi(U1(1),,Ui(1),Y(1)))2]\displaystyle\leq(1+\delta)\mathbb{E}^{(0)}\left[\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)^{2}\right] (111)
(1+δ)1irodd𝔼(0)[Γi2(U1(1),,Ui(1),Y(1))]\displaystyle\leq(1+\delta)\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U_{1}(1),\dots,U_{i}(1),Y(1))\right] (112)

where (111) follows since P(δ)P^{(\delta)} is dominated by (1+δ)P(0)(1+\delta)P^{(0)}; (112) used (94). Therefore

var(δ)(ΓB)\displaystyle\quad\operatorname{var}^{(\delta)}(\Gamma^{\rm B})
n(1+δ)1irodd𝔼(0)[Γi2(U1(1),,Ui(1),Y(1))]\displaystyle\leq n(1+\delta)\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U_{1}(1),\dots,U_{i}(1),Y(1))\right] (113)
=(1+δ)IB.\displaystyle=(1+\delta)I^{\rm B}. (114)

VI-D r=1r=1 Case

We now prove achievability bounds for the Bernoulli distribution estimation algorithm.

Corollary 7.

Given m1,m2>10m_{1},m_{2}>10, for r=1r=1 and α1:=m110\alpha_{1}:=\frac{m_{1}}{10}, the mean square error 𝔼[|δ^Bδ|2]25(1+δ)m1m2n\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}]\leq\frac{25(1+\delta)m_{1}m_{2}}{n} and total communication cost 𝔼[|Π1|]2.2(1+δ)nm1log2m110+1\mathbb{E}[|\Pi_{1}|]\leq\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1.

Proof.

We have

𝔼[|δ^Bδ|2]\displaystyle\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}] =var(δ)(ΓB)(δ𝔼(δ)[ΓB])2\displaystyle=\operatorname{var}^{(\delta)}(\Gamma^{\rm B})\cdot(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}])^{-2} (115)
(1+δ)IB(IB)2\displaystyle\leq(1+\delta)I^{\rm B}\cdot(I^{\rm B})^{-2} (116)
(1+δ)[2nlimδ0δ2I(U1;Y)]1\displaystyle\leq(1+\delta)\left[2n\lim_{\delta\to 0}\delta^{-2}I(U_{1};Y)\right]^{-1} (117)
25(1+δ)m1m2n\displaystyle\leq\frac{25(1+\delta)m_{1}m_{2}}{n} (118)

where (115) follows since δ^B\hat{\delta}^{\rm B} is unbiased (Theorem 3); (116) follows from (99) and (114); (117) follows from Lemma 6; lastly we used Corollary 4.

As for the communication cost

𝔼[|ΠAB|]\displaystyle\mathbb{E}[|\Pi^{\rm A\to B}|] 2(1+δ)n1i1PXUi1(0)(0,0)log2αi+1\displaystyle\leq 2(1+\delta)n\sum_{1\leq i\leq 1}P_{XU_{i-1}}^{(0)}(0,0)\log_{2}\alpha_{i}+1 (119)
2.2(1+δ)nm1log2m110+1\displaystyle\leq\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1 (120)

where we used (89) and Corollary 4. ∎

VI-E r=r=\infty Case

Corollary 8.

Let m1,m2>10m_{1},m_{2}>10, m:=min{m1,m2}m:=\min\{m_{1},m_{2}\}. For rr, (αi)(\alpha_{i}) defined in Section V-C, the mean square error 𝔼[|δ^Bδ|2]25(1+δ)m1m22nm\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}]\leq\frac{25(1+\delta)m_{1}m_{2}^{2}}{nm} and total communication cost 𝔼[|Πr|]6(1+δ)n(m11+m21)log2e+r+12\mathbb{E}[|\Pi^{r}|]\leq 6(1+\delta)n(m_{1}^{-1}+m_{2}^{-1})\log_{2}e+\frac{r+1}{2}.

Proof.

The bound on the mean square error is similar to the r=1r=1 case:

𝔼[|δ^Bδ|2]\displaystyle\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}] var(δ)(ΓB)(δ𝔼(δ)[ΓB])2\displaystyle\leq\operatorname{var}^{(\delta)}(\Gamma^{\rm B})\cdot(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}])^{-2} (121)
(1+δ)IB(IB)2\displaystyle\leq(1+\delta)I^{\rm B}\cdot(I^{\rm B})^{-2} (122)
(1+δ)[2nlimδ0δ21iroddI(Ui;Y|Ui1)]1\displaystyle\leq(1+\delta)\left[2n\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1})\right]^{-1} (123)
25(1+δ)m12m2mn\displaystyle\leq\frac{25(1+\delta)m_{1}^{2}m_{2}}{mn} (124)

except that we use Corollary 5 in the last step.

For the communication cost,

𝔼[|ΠAB|]\displaystyle\mathbb{E}[|\Pi^{\rm A\to B}|] 2(1+δ)n1iroddPXUi1(0)(0,0)log2αi+r+12\displaystyle\leq 2(1+\delta)n\sum_{1\leq i\leq r}^{\rm odd}P_{XU_{i-1}}^{(0)}(0,0)\log_{2}\alpha_{i}+\frac{r+1}{2} (125)
6(1+δ)n(m11+m21)log2e+r+12\displaystyle\leq 6(1+\delta)n(m_{1}^{-1}+m_{2}^{-1})\log_{2}e+\frac{r+1}{2} (126)

where used (89) and Corollary 5. ∎

VII Density Estimation Upper Bounds

In this section we prove the upper bounds in Theorem 1 and Theorem 2, by building nonparametric density estimators based on the Bernoulli distribution estimator. For β(0,2]\beta\in(0,2], the rectangular kernel is minimax optimal (Section II-B), so that the integral with the kernel can be directly estimated using the Bernoulli distribution estimator, which we explain in Section VII-B and VII-C. Extension to higher order kernels is possible using a linear combination of rectangular kernels, which is explained in Section VII-D.

VII-A Density Lower Bound Assumption

First, we observe the following simple argument showing that it suffices to consider A>0A>0. Define

B:=supx,y[0,1]dsuppXYpXY(x,y),\displaystyle B:=\sup_{x,y\in[0,1]^{d}}\sup_{p_{XY}}p_{XY}(x,y), (127)

where the supremum is over all density pXYp_{XY} on 2d\mathbb{R}^{2d} satisfying pXY[0,1]2d,βL\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L. Clearly B>1B>1 is finite and depends only on β,L,d\beta,L,d.

Lemma 9.

In either the one-way or the interactive setting, suppose that there exists an algorithm achieving maxpXY(β,L,A)𝔼[|p^pXY(x0,y0)|2]R\max_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq R for some R>0R>0 and A(0,1)A\in(0,1). Then, there must be an algorithm achieving maxpXY(β,L,0)𝔼[|p^pXY(x0,y0)|2](1+A1A)2R\max_{p_{XY}\in\mathcal{H}(\beta,L,0)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq\left(\frac{1+A}{1-A}\right)^{2}R.

Proof.

Pick one pXYp_{XY} such that pXY[0,1]2d,βL\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L and infx,y[0,1]dpXY(x,y)1+A2>A\inf_{x,y\in[0,1]^{d}}p_{XY}(x,y)\geq\frac{1+A}{2}>A. Since 1+A2<1\frac{1+A}{2}<1, such pXYp_{XY} exists. Consider an arbitrary qXY(β,L,0)q_{XY}\in\mathcal{H}(\beta,L,0), and suppose infinite pairs (X1,Y1),(X_{1},Y_{1}),\dots i.i.d. according to qXYq_{XY} are available to Alice and Bob. Using the common randomness, Alice and Bob can simulate i.i.d. pairs (X~1,Y~1),(\tilde{X}_{1},\tilde{Y}_{1}),\dots according to p~XY:=2A1+ApXY+1A1+AqXY\tilde{p}_{XY}:=\frac{2A}{1+A}p_{XY}+\frac{1-A}{1+A}q_{XY}, by replacing each pair with probability 2A1+A\frac{2A}{1+A} with a new pair drawn according to pXYp_{XY}. Clearly p~XY(β,L,A)\tilde{p}_{XY}\in\mathcal{H}(\beta,L,A), and by assumption, p~XY(x0,y0)\tilde{p}_{XY}(x_{0},y_{0}) can be estimated with mean square risk RR. Since pXYp_{XY} is known, this implies that qXY(x0,y0)q_{XY}(x_{0},y_{0}) can be estimated with mean square risk (1+A1A)2R\left(\frac{1+A}{1-A}\right)^{2}R. ∎

For the rest of the section, we will assume that there is a density lower bound A>0A>0 and pXY(β,L,A)p_{XY}\in\mathcal{H}(\beta,L,A), which is sufficient in view of Lemma 9. Consider bandwidth h>0h>0 (which will be specified later as an inverse polynomial of kk). Also introduce the notations

𝒜\displaystyle\mathcal{A} :=x0+h[1,1]d;\displaystyle:=x_{0}+h[-1,1]^{d}; (128)
\displaystyle\mathcal{B} :=y0+h[1,1]d.\displaystyle:=y_{0}+h[-1,1]^{d}. (129)

Define PX¯Y¯P_{\underline{X}\underline{Y}} as the probability distribution induced by PXYP_{XY} and with

X¯\displaystyle\underline{X} :=1{X𝒜};\displaystyle:=1\{X\notin\mathcal{A}\}; (130)
Y¯\displaystyle\underline{Y} :=1{Y}.\displaystyle:=1\{Y\notin\mathcal{B}\}. (131)

Define

m1\displaystyle m_{1} :=1(X¯=0);\displaystyle:=\mathbb{P}^{-1}(\underline{X}=0); (132)
m2\displaystyle m_{2} :=1(Y¯=0).\displaystyle:=\mathbb{P}^{-1}(\underline{Y}=0). (133)

Note that m1/m2m_{1}/m_{2} is bounded above and below by positive constants depending on AA, β\beta, and LL (see (138) and (143)). Also, we can assume Alice and Bob both know m1m_{1} and m2m_{2}, since with infinite samples Alice and Bob know their marginal densities pXp_{X} and pYp_{Y}, and Alice can send m1m_{1} to Bob with very high precision using negligible number of bits. Let δ1\delta\geq-1 be the number such that PX¯Y¯P_{\underline{X}\underline{Y}} is the matrix

(1m1m2(1+δ)1m1(11m2)δm1m21m2(11m1)δm1m2(11m2)2+δm1m2).\displaystyle\left(\begin{array}[]{cc}\frac{1}{m_{1}m_{2}}(1+\delta)&\frac{1}{m_{1}}(1-\frac{1}{m_{2}})-\frac{\delta}{m_{1}m_{2}}\\ \frac{1}{m_{2}}(1-\frac{1}{m_{1}})-\frac{\delta}{m_{1}m_{2}}&(1-\frac{1}{m_{2}})^{2}+\frac{\delta}{m_{1}m_{2}}\end{array}\right). (136)

Let δ^B\hat{\delta}^{\rm B} be Bob’s estimator of δ\delta in (52). Then we define Bob’s density estimator:

p^B:=1+δ^Bm1m2h2d.\displaystyle\hat{p}^{\rm B}:=\frac{1+\hat{\delta}^{\rm B}}{m_{1}m_{2}h^{2d}}. (137)

We next show that the smoothness of the density ensures that 1+δ1+\delta is at most the order of a constant. Recall that AA is a density lower bound on pXp_{X} and pYp_{Y}. Define M:=max{m1,m2}M:=\max\{m_{1},m_{2}\} and m:=min{m1,m2}m:=\min\{m_{1},m_{2}\}. The definition of (m1,m2)(m_{1},m_{2}) implies Ahd1MAh^{d}\leq\frac{1}{M}, and hence

h(1AM)1/d.\displaystyle h\leq\left(\frac{1}{AM}\right)^{1/d}. (138)

Recall BB defined in (127). We then have

1+δ\displaystyle 1+\delta =m1m2PXY(𝒜×)\displaystyle=m_{1}m_{2}P_{XY}(\mathcal{A}\times\mathcal{B}) (139)
m1m2Bh2d\displaystyle\leq m_{1}m_{2}Bh^{2d} (140)
Bm1m2A2m2\displaystyle\leq\frac{Bm_{1}m_{2}}{A^{2}m^{2}} (141)
=BMA2m.\displaystyle=\frac{BM}{A^{2}m}. (142)

Next, observe that 1=m1PX(𝒜×[1,1]d)m1Bhd1=m_{1}P_{X}(\mathcal{A}\times[-1,1]^{d})\leq m_{1}Bh^{d} which yields hd1m1Bh^{d}\geq\frac{1}{m_{1}B}. Similarly we also have hd1m2Bh^{d}\geq\frac{1}{m_{2}B}, therefore

hd1mB\displaystyle h^{d}\geq\frac{1}{mB} (143)

Together with (138), we see that hd=Θ(1/m)=Θ(1/M)h^{d}=\Theta(1/m)=\Theta(1/M).

Next, the bias of the density estimator is

𝔼[p^B]pXY(x0,y0)=PXY(𝒜×)h2dpXY(x0,y0)\displaystyle\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})=\frac{P_{XY}(\mathcal{A}\times\mathcal{B})}{h^{2d}}-p_{XY}(x_{0},y_{0}) (144)

which is just the bias of the rectangular kernel estimator (with bandwidth hh in each of the two subspaces). The rectangle kernel is order 11 [44, Definition 1.3] and compactly supported while, by assumption β(0,2]\beta\in(0,2], and therefore the bias is bounded by ([44, Proposition 1.2])

|𝔼[p^B]pXY(x0,y0)|Chβ\displaystyle|\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})|\leq Ch^{\beta} (145)

where CC is a constant depending only on β,d\beta,d and LL.

VII-B One-Way Case

By Corollary 7 and (143), we can bound the variance of the density estimator as

var(p^B)\displaystyle\operatorname{var}(\hat{p}^{\rm B}) =1m12m22h4dvar(δ^B)\displaystyle=\frac{1}{m_{1}^{2}m_{2}^{2}h^{4d}}\operatorname{var}(\hat{\delta}^{\rm B}) (146)
1m12m22h4d25(1+δ)m1m2n\displaystyle\leq\frac{1}{m_{1}^{2}m_{2}^{2}h^{4d}}\cdot\frac{25(1+\delta)m_{1}m_{2}}{n} (147)
25(1+δ)B4m3nM\displaystyle\leq\frac{25(1+\delta)B^{4}m^{3}}{nM} (148)

where (148) used (143). Also by Corollary 7, the communication constraint is satisfied if the following holds

2.2(1+δ)nm1log2m110+1k.\displaystyle\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1\leq k. (149)

Now we can choose hh so that m1=(klog2k)dd+2βm_{1}=(\frac{k}{\log_{2}k})^{\frac{d}{d+2\beta}} as defined by (132), and set

n\displaystyle n =(2.2(1+δmax)m1log2m110)1(k1)\displaystyle=\left\lfloor\left(\frac{2.2(1+\delta_{\rm max})}{m_{1}}\log_{2}\frac{m_{1}}{10}\right)^{-1}(k-1)\right\rfloor (150)

where δmax\delta_{\rm max}, defined as the right side of (142) and hence depends only on (A,β,L)(A,\beta,L), is an upper estimate of the true parameter δ\delta. Then the communication constraint is satisfied. Moreover by the bias (145) and the variance (148) bounds, the risk is bounded by

|𝔼[p^B]pXY(x0,y0)|2+var(p^B)\displaystyle\quad|\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})|^{2}+\operatorname{var}(\hat{p}^{\rm B})
C2h2β+25(1+δ)B4m3nM\displaystyle\leq C^{2}h^{2\beta}+\frac{25(1+\delta)B^{4}m^{3}}{nM} (151)
=(Am)2β/d+25(1+δ)B4m3nM\displaystyle=(Am)^{-2\beta/d}+\frac{25(1+\delta)B^{4}m^{3}}{nM} (152)
D(klogk)2βd+2β\displaystyle\leq D(\frac{k}{\log k})^{-\frac{2\beta}{d+2\beta}} (153)

where DD is a constant depending only on β\beta, LL, and AA, and we used the fact that δ\delta is bounded above by (143) and the bound on hh shown in (138). This proves the upper bound in Theorem 1 for β(0,2]\beta\in(0,2].

VII-C Interactive Case

Choose hh such that mm as defined by m:=min{m1,m2}m:=\min\{m_{1},m_{2}\} and (132)-(133) satisfies

m:=kdd+2β,\displaystyle m:=k^{\frac{d}{d+2\beta}}, (154)

and set

n:=mkln213(1+δmax)\displaystyle n:=\left\lfloor\frac{mk\ln 2}{13(1+\delta_{\rm max})}\right\rfloor (155)

where as before δmax\delta_{\rm max} is an upper bound on δ\delta and only depends on (A,β,L)(A,\beta,L). By Corollary 8, for kk large enough we have m>10m>10, and the communication cost is bounded by kk. Moreover from (152), the risk is bounded by Dk2βd+2βDk^{-\frac{2\beta}{d+2\beta}} for some DD depending only on β\beta, LL, and AA. This proves the upper bound in Theorem 2 when β(0,2]\beta\in(0,2].

VII-D Extension to β>2\beta>2

For β>2\beta>2, the rectangular kernel is no longer minimax optimal. However, observe the following:

Proposition 10.

For any positive integers dd and ll, there exists an order ll-kernel in d\mathbb{R}^{d} which is a linear combination of (l/2+1)d(\lfloor l/2\rfloor+1)^{d} indicator functions.

Proof.

In the following we prove for d=1d=1; the case of general dd will then follow by taking the tensor product of kernel functions in \mathbb{R}. Note that an ll-th kernel must satisfy the following equations:

K(u)𝑑u\displaystyle\int K(u)du =1;\displaystyle=1; (156)
ujK(u)𝑑u\displaystyle\int u^{j}K(u)du =0,j=1,,l.\displaystyle=0,\quad j=1,\dots,l. (157)

Let use consider KK of the following form:

K(u)=k=1k0ck1[k,k]\displaystyle K(u)=\sum_{k=1}^{k_{0}}c_{k}1_{[-k,k]} (158)

where k0:=l/2+1k_{0}:=\lfloor l/2\rfloor+1. Since such K(u)K(u) is an even function, (156)-(156) yield k0k_{0} nontrivial equations for c1,,ck0c_{1},\dots,c_{k_{0}} (i.e., only when jj is even):

2k=1k0kck=1;\displaystyle 2\sum_{k=1}^{k_{0}}kc_{k}=1; (159)
k=1k02kj+1j+1ck\displaystyle\sum_{k=1}^{k_{0}}\frac{2k^{j+1}}{j+1}c_{k} =0,j{1,,l}2.\displaystyle=0,\quad j\in\{1,\dots,l\}\cap 2\mathbb{Z}. (160)

From the formula for the determinant of the Vandermonde matrix, we see that these equations have a unique solution for c1,,ck0c_{1},\dots,c_{k_{0}}. ∎

Now for general β>0\beta>0, we can take an order l=βl=\lfloor\beta\rfloor kernel as in Proposition 10. We can estimate 1h2dpXY(x,y)K((x,y)(x0,y0)h)\frac{1}{h^{2d}}\int p_{XY}(x,y)K(\frac{(x,y)-(x_{0},y_{0})}{h}) by applying the Bernoulli distribution estimator repeatedly for (l/2+1)2d(\lfloor l/2\rfloor+1)^{2d} times. Therefore by the similar arguments as the preceding sections we see that the upper bounds in Theorem 1 and Theorem 2 hold for β>2\beta>2 as well.

VIII One-way Density Estimation Lower Bound

VIII-A Upper Bounding s(X;Y)s^{*}(X;Y)

The pointwise estimation lower bound is obtained by lower bounding the risk for estimating PX¯Y¯P_{\underline{X}\underline{Y}} (with X¯\underline{X} and Y¯\underline{Y} being indicators of neighborhoods of x0x_{0} and y0y_{0}), and applying Le Cam’s inequality to the latter. Therefore we are led to considering the strong data processing constant for the biased Bernoulli distribution.

Theorem 5.

Let PXY(δ)P_{XY}^{(\delta)} be as defined in (31). where δ(1,1)\delta\in(-1,1) and m>1m>1. Then s(X;Y)δ2mlnmm+1s^{*}(X;Y)\leq\frac{\delta^{2}}{m\ln m-m+1}.

Proof.

For this proof we can assume without loss of generality that the logarithms are natural. For any QXQ_{X}, let QYQ_{Y} be the output through the channel PY|XP_{Y|X}. Then

s(X;Y)\displaystyle s^{*}(X;Y) D(QYPY)D(QXPX)\displaystyle\leq\frac{D(Q_{Y}\|P_{Y})}{D(Q_{X}\|P_{X})} (161)
Dχ2(QYPY)Dχ2(QXPX)Dχ2(QXPX)D(QXPX)\displaystyle\leq\frac{D_{\chi^{2}}(Q_{Y}\|P_{Y})}{D_{\chi^{2}}(Q_{X}\|P_{X})}\cdot\frac{D_{\chi^{2}}(Q_{X}\|P_{X})}{D(Q_{X}\|P_{X})} (162)
δ2mlnmm+1\displaystyle\leq\frac{\delta^{2}}{m\ln m-m+1} (163)

where we define the χ2\chi^{2} divergence in (7). The justification of the steps are as follows: (161) is well-known. (162) follows since the χ2\chi^{2} divergence dominates the KL divergence (see e.g. [39]). To see (163), note that Dχ2(QYPY)Dχ2(QXPX)\frac{D_{\chi^{2}}(Q_{Y}\|P_{Y})}{D_{\chi^{2}}(Q_{X}\|P_{X})} is upper bounded by ρ𝗆2(X,Y)\rho_{\sf m}^{2}(X,Y), the square of the maximal correlation coefficient (see e.g. [5, 6]). As the operator norm of a linear operator, ρ𝗆(X,Y)\rho_{\sf m}(X,Y) can be shown to equal the second largest singular value of

𝐌\displaystyle{\bf M} :=(1PX(x)PXY(x,y)1PY(y))x,y\displaystyle:=\left(\frac{1}{\sqrt{P_{X}(x)}}P_{XY}(x,y)\frac{1}{\sqrt{P_{Y}(y)}}\right)_{x,y} (164)
=[1+δm11m+δm(m1)];\displaystyle=\begin{bmatrix}\frac{1+\delta}{m}&*\\ *&1-\frac{1}{m}+\frac{\delta}{m(m-1)}\end{bmatrix}; (165)

see e.g. [6]. Since 𝐌{\bf M} is a symmetric matrix, its singular values are its eigenvalues. The largest eigenvalue of 𝐌{\bf M} is 1, corresponding to the eigenvector (PX(0),PX(1))(\sqrt{P_{X}(0)},\sqrt{P_{X}(1)}) (which is evident from (164)), whereas the trace

tr(𝐌)=1+δm+11m+δm(m1)=1+δm1\displaystyle\operatorname{tr}({\bf M})=\frac{1+\delta}{m}+1-\frac{1}{m}+\frac{\delta}{m(m-1)}=1+\frac{\delta}{m-1} (166)

which is evident from (165). Therefore ρm(X;Y)=δm1\rho_{\rm m}(X;Y)=\frac{\delta}{m-1}. Moreover, since χ2\chi^{2} and KLKL divergences are both ff-divergences, their ratio can be bounded by the ratio of their corresponding ff-functions (see e.g. [39]):

Dχ2(QXPX)D(QXPX)\displaystyle\frac{D_{\chi^{2}}(Q_{X}\|P_{X})}{D(Q_{X}\|P_{X})} sup0<tm(t1)2tlntt+1\displaystyle\leq\sup_{0<t\leq m}\frac{(t-1)^{2}}{t\ln t-t+1} (167)
=(m1)2mlnmm+1;\displaystyle=\frac{(m-1)^{2}}{m\ln m-m+1}; (168)

The constraint tmt\leq m in (167) is because minxPX(x)=1m\min_{x}P_{X}(x)=\frac{1}{m} and maxxQX(x)PX(x)m\max_{x}\frac{Q_{X}(x)}{P_{X}(x)}\leq m. To show (168), it suffices to show that

infu(1,m1](1+u)ln(1+u)uu2\inf_{u\in(-1,m-1]}\frac{(1+u)\ln(1+u)-u}{u^{2}}

is achieved at u=m1u=m-1. For this, it suffices to show that the derivative of the objective function, (2+u)ln(1+u)+2uu3\frac{-(2+u)\ln(1+u)+2u}{u^{3}} is negative on (1,m1](-1,m-1]. Indeed, define ϕ(u):=(2+u)ln(1+u)2u\phi(u):=(2+u)\ln(1+u)-2u. We can check that ϕ(0)=0\phi(0)=0, ϕ(0)=0\phi^{\prime}(0)=0, and ϕ′′(u)=u(1+u)2\phi^{\prime\prime}(u)=\frac{u}{(1+u)^{2}}, which imply that ϕ(u)>0\phi(u)>0 for u>0u>0 and ϕ(u)<0\phi(u)<0 for u<0u<0. Therefore (168), and hence (163), holds. ∎

VIII-B Lower Bounding One-Way NP Estimation Risk

Given k,d,β,L,Ak,d,\beta,L,A, consider a distribution PX¯Y¯P_{\underline{X}\underline{Y}} on {0,1}2\{0,1\}^{2} with matrix

(1m2(1+δ)1m(11m)δm21m(11m)δm2(11m)2+δm2),\displaystyle\left(\begin{array}[]{cc}\frac{1}{m^{2}}(1+\delta)&\frac{1}{m}(1-\frac{1}{m})-\frac{\delta}{m^{2}}\\ \frac{1}{m}(1-\frac{1}{m})-\frac{\delta}{m^{2}}&(1-\frac{1}{m})^{2}+\frac{\delta}{m^{2}}\end{array}\right), (171)

where m:=(aklnk)d2β+dm:=\left(\frac{ak}{\ln k}\right)^{\frac{d}{2\beta+d}} and δ:=mβd\delta:=m^{-\frac{\beta}{d}}, with a:=16β+8ddln2a:=\frac{16\beta+8d}{d}\ln 2 being a constant. We then need to “simulate” smooth distributions from PX¯Y¯P_{\underline{X}\underline{Y}}. Let f:d[0,)f\colon\mathbb{R}^{d}\to[0,\infty) be a function satisfying the following properties:

  • ff has a compact support;

  • df=1\int_{\mathbb{R}^{d}}f=1;

  • f(0)>0f(0)>0;

  • f(x)[0,1]f(x)\in[0,1], for all xdx\in\mathbb{R}^{d};

  • fd,β<L4\|f\|_{\mathbb{R}^{d},\beta}<\frac{L}{4};

  • Define g(x,y)=f(x)f(y)g(x,y)=f(x)f(y) as a function on 2d\mathbb{R}^{2d}. Then g2d,β<L4\|g\|_{\mathbb{R}^{2d},\beta}<\frac{L}{4}.

Clearly, such a function exists for any given β,L,d\beta,L,d. For sufficiently large mm such that m1/dsupp(f)+x0[0,1]dm^{-1/d}\operatorname{supp}(f)+x_{0}\in[0,1]^{d} and m1/dsupp(f)+y0[0,1]dm^{-1/d}\operatorname{supp}(f)+y_{0}\in[0,1]^{d} (recall that (x0,y0)(x_{0},y_{0}) is the given point in the density estimation problem), define

pX|X¯=0(x):=1PX¯(0)f(m1d(xx0)),xd;\displaystyle p_{X|\underline{X}=0}(x):=\frac{1}{P_{\underline{X}}(0)}f(m^{\frac{1}{d}}(x-x_{0})),\quad\forall x\in\mathbb{R}^{d}; (172)
pY|Y¯=0(y):=1PY¯(0)f(m1d(yy0)),yd.\displaystyle p_{Y|\underline{Y}=0}(y):=\frac{1}{P_{\underline{Y}}(0)}f(m^{\frac{1}{d}}(y-y_{0})),\quad\forall y\in\mathbb{R}^{d}. (173)

Since PX¯(0)=PY¯(0)=1mP_{\underline{X}}(0)=P_{\underline{Y}}(0)=\frac{1}{m}, clearly the above define valid probability densities supported on [0,1]d[0,1]^{d}. Define

pX|X¯=1(x):=1{x[0,1]d}PX¯(1)[1f(m1/d(xx0))];\displaystyle p_{X|\underline{X}=1}(x):=\frac{1\{x\in[0,1]^{d}\}}{P_{\underline{X}}(1)}[1-f(m^{1/d}(x-x_{0}))]; (174)
pY|Y¯=1(y):=1{y[0,1]d}PY¯(1)[1f(m1/d(yy0))],\displaystyle p_{Y|\underline{Y}=1}(y):=\frac{1\{y\in[0,1]^{d}\}}{P_{\underline{Y}}(1)}[1-f(m^{1/d}(y-y_{0}))], (175)

which are also probability densities supported on [0,1]d[0,1]^{d}. Define PXY|XY¯=PX|X¯PY|Y¯P_{XY|\underline{XY}}=P_{X|\underline{X}}P_{Y|\underline{Y}}, where PX|X¯P_{X|\underline{X}} and PY|Y¯P_{Y|\underline{Y}} are conditional distributions defined by the densities above. Under the joint distribution PXYXY¯P_{XY\underline{XY}}, we have

pX(x)=pY(y)=1,x,y[0,1]d.\displaystyle p_{X}(x)=p_{Y}(y)=1,\quad\forall x,y\in[0,1]^{d}. (176)

Define

P¯XYXY¯=PX|X¯PY|Y¯PX¯PY¯.\displaystyle\bar{P}_{XY\underline{XY}}=P_{X|\underline{X}}P_{Y|\underline{Y}}P_{\underline{X}}P_{\underline{Y}}. (177)

We now check that the density of PXYP_{XY} satisfies pXY(0,1)2d,βL\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L for mm sufficiently large. Indeed, for x,y[0,1]dx,y\in[0,1]^{d},

pXY(x,y)\displaystyle\quad p_{XY}(x,y)
=i,j=0,1pXY|XY¯=(i,j)(x,y)PXY¯(i,j)\displaystyle=\sum_{i,j=0,1}p_{XY|\underline{XY}=(i,j)}(x,y)P_{\underline{XY}}(i,j) (178)
=i,j=0,1pXY|XY¯=(i,j)(x,y)P¯XY¯(i,j)\displaystyle=\sum_{i,j=0,1}p_{XY|\underline{XY}=(i,j)}(x,y)\bar{P}_{\underline{XY}}(i,j)
+δm2(pX|X¯=0(x)pX|X¯=1(x))(pY|Y¯=0(y)pY|Y¯=1(y))\displaystyle\quad+\tfrac{\delta}{m^{2}}(p_{X|\underline{X}=0}(x)-p_{X|\underline{X}=1}(x))(p_{Y|\underline{Y}=0}(y)-p_{Y|\underline{Y}=1}(y)) (179)
=1+δm2(pX|X¯=0(x)pX|X¯=1(x))\displaystyle=1+\tfrac{\delta}{m^{2}}(p_{X|\underline{X}=0}(x)-p_{X|\underline{X}=1}(x))
(pY|Y¯=0(y)pY|Y¯=1(y))\displaystyle\quad\cdot(p_{Y|\underline{Y}=0}(y)-p_{Y|\underline{Y}=1}(y)) (180)
=1+δ[1m1+111mf(m1/d(xx0))]\displaystyle=1+\delta\left[-\tfrac{1}{m-1}+\tfrac{1}{1-\tfrac{1}{m}}f(m^{1/d}(x-x_{0}))\right]
[1m1+111mf(m1/d(yy0))]\displaystyle\quad\cdot\left[-\tfrac{1}{m-1}+\tfrac{1}{1-\tfrac{1}{m}}f(m^{1/d}(y-y_{0}))\right] (181)
=const.δm(m1)2f(m1/d(xx0))\displaystyle={\rm const.}-\frac{\delta m}{(m-1)^{2}}f(m^{1/d}(x-x_{0}))
δm(m1)2f(m1/d(yy0))\displaystyle\quad-\frac{\delta m}{(m-1)^{2}}f(m^{1/d}(y-y_{0}))
+δ(11/m)2f(m1/d(xx0))f(m1/d(yy0)).\displaystyle\quad+\frac{\delta}{(1-1/m)^{2}}f(m^{1/d}(x-x_{0}))f(m^{1/d}(y-y_{0})). (182)

By the assumptions on ff, we see that

mβ/df(m1/d(x0))(0,1)d,β\displaystyle\|m^{-\beta/d}f(m^{1/d}(\cdot-x_{0}))\|_{(0,1)^{d},\beta} L4;\displaystyle\leq\frac{L}{4}; (183)
mβ/df(m1/d(y0))(0,1)d,β\displaystyle\|m^{-\beta/d}f(m^{1/d}(\cdot-y_{0}))\|_{(0,1)^{d},\beta} L4;\displaystyle\leq\frac{L}{4}; (184)
mβ/df(m1/d(x0))f(m1/d(y0))(0,1)2d,β\displaystyle\|m^{-\beta/d}f(m^{1/d}(\cdot-x_{0}))f(m^{1/d}(*-y_{0}))\|_{(0,1)^{2d},\beta} L4.\displaystyle\leq\frac{L}{4}. (185)

Therefore with the choice δ=mβ/d\delta=m^{-\beta/d}, we have pXY(0,1)2d,βL\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L for m10m\geq 10.

Now we can apply a Le Cam style argument [51] for the estimation lower bound. Suppose that there exists an algorithm that estimates the density at (x0,y0)(x_{0},y_{0}) as p^\hat{p}. Alice and Bob can convert this to an algorithm for testing the binary distributions PXY¯P_{\underline{XY}} against P¯XY¯\bar{P}_{\underline{XY}}. Indeed, suppose that (𝐗¯,𝐘¯){\bf(\underline{X},\underline{Y})} is an infinite sequence of i.i.d. random variable pairs according to either PXY¯P_{\underline{XY}} or P¯XY¯\bar{P}_{\underline{XY}}. Using the local randomness (which is implied by the common randomness), Alice and Bob can simulate the sequence of i.i.d. random variables (𝐗,𝐘){\bf(X,Y)} according to either PXYP_{XY} or P¯XY\bar{P}_{XY}, by applying the random transformations PX|X¯P_{X|\underline{X}} and PY|Y¯P_{Y|\underline{Y}} coordinate-wise. Then Alice and Bob can apply the density estimation algorithm to obtain p^\hat{p}. Note that p¯XY(x0,y0)=1\bar{p}_{XY}(x_{0},y_{0})=1 while

pXY(x0,y0)=1+δ[mm1f(0)1m1]2,\displaystyle p_{XY}(x_{0},y_{0})=1+\delta\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{2}, (186)

the latter following from (181). Now suppose that Bob declares PXY¯P_{\underline{XY}} if

|p^pXY(x0,y0)||p^1|,\displaystyle|\hat{p}-p_{XY}(x_{0},y_{0})|\leq|\hat{p}-1|, (187)

and P¯XY¯\bar{P}_{\underline{XY}} otherwise. By Chebyshev’s inequality, the error probability (under either hypothesis) is upper bounded by

4δ2[mm1f(0)1m1]4suppXY(β,L,A)𝔼[|p^pXY(x0,y0)|2].\displaystyle 4\delta^{-2}\left[\tfrac{m}{m-1}f(0)-\tfrac{1}{m-1}\right]^{-4}\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]. (188)

On the other hand, from (22) and Theorem 5 we have

D(P𝐘¯ΠP¯𝐘¯Π)H(Π)\displaystyle\frac{D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})}{H(\Pi)} s(X¯;Y¯)\displaystyle\leq s^{*}(\underline{X};\underline{Y}) (189)
δ2mlnmm+1\displaystyle\leq\frac{\delta^{2}}{m\ln m-m+1} (190)
2δ2md4β+2dlnk\displaystyle\leq\frac{2\delta^{2}}{m\cdot\frac{d}{4\beta+2d}\ln k} (191)
8β+4ddak\displaystyle\leq\frac{8\beta+4d}{dak} (192)

when mm is sufficiently large. However, it is known (from Kraft’s inequality, see e.g. [14]) that the expected length of a prefix code upper bounds the entropy. Thus

k𝔼[|Π|]1log2H(Π)\displaystyle k\geq\mathbb{E}[|\Pi|]\geq\frac{1}{\log 2}H(\Pi) (193)

and therefore

D(P𝐘¯ΠP¯𝐘¯Π)8β+4ddalog2.\displaystyle D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})\leq\frac{8\beta+4d}{da}\log 2. (194)

Then by Pinsker’s inequality (e.g. [44]),

1𝑑P𝐘¯ΠdP¯𝐘¯Π\displaystyle 1-\int dP_{{\bf\underline{Y}}\Pi}\wedge d\bar{P}_{{\bf\underline{Y}}\Pi} 12logeD(P𝐘¯ΠP¯𝐘¯Π)\displaystyle\leq\sqrt{\frac{1}{2\log e}D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})} (195)
4β+2ddaln2\displaystyle\leq\sqrt{\frac{4\beta+2d}{da}\ln 2} (196)
=12\displaystyle=\frac{1}{2} (197)

where the last line follows from our choice a=16β+8ddln2a=\frac{16\beta+8d}{d}\ln 2. However, 𝑑P𝐘¯ΠdP¯𝐘¯Π\int dP_{{\bf\underline{Y}}\Pi}\wedge d\bar{P}_{{\bf\underline{Y}}\Pi} lower bounds twice of (188). Therefore we have

suppXY(β,L,A)𝔼[|p^pXY(x0,y0)|2]\displaystyle\quad\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]
δ28[mm1f(0)1m1]412\displaystyle\geq\frac{\delta^{2}}{8}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}\cdot\frac{1}{2} (198)
=116m2β/d[mm1f(0)1m1]4\displaystyle=\frac{1}{16}m^{-2\beta/d}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4} (199)

which is lower bounded by 117m2β/df4(0)=f4(0)17(aklnk)2β2β+d\frac{1}{17}m^{-2\beta/d}f^{4}(0)=\frac{f^{4}(0)}{17}\left(\frac{ak}{\ln k}\right)^{-\frac{2\beta}{2\beta+d}} for large enough kk. Since aa and f(0)f(0) depend only on d,β,Ld,\beta,L, this establishes the lower bound in Theorem 1.

IX Interactive Density Estimation Lower Bound

In this section we prove the lower bound in Theorem 2.

IX-A Upper Bounding s(X;Y)s^{*}_{\infty}(X;Y)

The heart of the proof is the following technical result:

Theorem 6.

There exists c>0c>0 small enough such that the following holds: For any PXYP_{XY} which is a distribution on {0,1}2\{0,1\}^{2} corresponding to the following matrix:

(p2(1+δ)pp¯p2δpp¯p2δp¯2+p2δ)\displaystyle\left(\begin{array}[]{cc}p^{2}(1+\delta)&p\bar{p}-p^{2}\delta\\ p\bar{p}-p^{2}\delta&\bar{p}^{2}+p^{2}\delta\end{array}\right) (202)

where p,|δ|[0,c)p,|\delta|\in[0,c) and we used the notation p¯:=1p\bar{p}:=1-p, we have

s(X;Y)c1pδ2.\displaystyle s_{\infty}^{*}(X;Y)\leq c^{-1}p\delta^{2}. (203)

The proof can be found in Appendix B.

IX-B Lower Bounding Interactive NP Estimation Risk

The proof is similar to the one-way case (Section VIII-B). Consider the distribution PXY¯P_{\underline{XY}} on {0,1}2\{0,1\}^{2} as in (171). Let m:=(ak)d2β+dm:=(ak)^{\frac{d}{2\beta+d}} and δ:=mβd\delta:=m^{-\frac{\beta}{d}}, where a=2ln2ca=\frac{2\ln 2}{c} with cc being the absolute constant in Theorem 6. Pick the function ff, and define PXY¯XYP_{\underline{XY}XY} and P¯XY¯XY\bar{P}_{\underline{XY}XY} as before. Note that, as before, p¯XY\bar{p}_{XY} is uniform on [0,1]2d[0,1]^{2d}, while pXY(0,1)2d,βL\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L for m10m\geq 10. pXY(x0,y0)p_{XY}(x_{0},y_{0}) has the same formula (186), and Alice and Bob can convert a (now interactive) density estimation algorithm to an algorithm for testing PXY¯P_{\underline{XY}} against P¯XY¯\bar{P}_{\underline{XY}}. With the same testing rule (187), the error probability under either hypothesis is again upper bounded by (188).

Changes arise in (189), where we shall apply Theorem 6 instead. Note that for the absolute constant cc in Theorem 6, the condition 1m,|δ|<c\frac{1}{m},|\delta|<c is satisfied for sufficiently large kk (hence sufficiently large mm).

D(P𝐘¯ΠP¯𝐘¯Π)H(Π)\displaystyle\frac{D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})}{H(\Pi)} s(X¯;Y¯)\displaystyle\leq s^{*}_{\infty}(\underline{X};\underline{Y}) (204)
c1δ2m\displaystyle\leq\frac{c^{-1}\delta^{2}}{m} (205)
(cak)1.\displaystyle\leq(cak)^{-1}. (206)

Again using Kraft’s inequality to Bound H(Π)H(\Pi), we obtain

D(P𝐘¯ΠP¯𝐘¯Π)log2ca.\displaystyle D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})\leq\frac{\log 2}{ca}. (207)

Then Pinsker’s inequality yields

1𝑑P𝐘¯ΠdP¯𝐘¯Πln22ca=12\displaystyle 1-\int dP_{\underline{\bf Y}\Pi}\wedge d\bar{P}_{\underline{\bf Y}\Pi}\leq\sqrt{\frac{\ln 2}{2ca}}=\frac{1}{2} (208)

since we selected a=2ln2ca=\frac{2\ln 2}{c}. Again 𝑑P𝐘¯ΠdP¯𝐘¯Π\int dP_{\underline{\bf Y}\Pi}\wedge d\bar{P}_{\underline{\bf Y}\Pi} lower bounds twice of (188), therefore

suppXY(β,L,A)𝔼[|p^pXY(x0,y0)|2]\displaystyle\quad\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]
δ28[mm1f(0)1m1]412\displaystyle\geq\frac{\delta^{2}}{8}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}\cdot\frac{1}{2} (209)
=116m2β/d[mm1f(0)1m1]4\displaystyle=\frac{1}{16}m^{-2\beta/d}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4} (210)
f4(0)17(ak)2β2β+d\displaystyle\geq\frac{f^{4}(0)}{17}(ak)^{-\frac{2\beta}{2\beta+d}} (211)

where the last line holds for sufficiently large kk. Since aa is a universal constant and ff depends on d,β,Ld,\beta,L only, this completes the proof of the interactive lower bound.

X Acknowledgement

The author would like to thank Professor Venkat Anantharam for bringing the reference [36] to my attention and some interesting discussions. This research was supported by the starting grant from the Department of Statistics, University of Illinois, Urbana-Champaign.

Appendix A Proof of Theorem 4

First, assume that i{1,2,,r}2i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}. By the definitions of PUi|XUi1P_{U_{i}|XU^{i-1}} and PUi|YUi1P_{U_{i}|YU^{i-1}}, we can verify that the following holds (for δ=0\delta=0):

PX|Ui(0)(0|𝟎)=PX(0)PX(1)1jioddαj1+PX(0).\displaystyle P_{X|U^{i}}^{(0)}(0|{\bf 0})=\frac{P_{X}(0)}{P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}. (212)

Indeed, (212) follows by applying induction on the following

PX|Ui(0)(0|𝟎)\displaystyle P^{(0)}_{X|U^{i}}(0|{\bf 0}) =p0p0+p1\displaystyle=\frac{p_{0}}{p_{0}+p_{1}} (213)
=PX|Ui2(0)(0|𝟎)PX|Ui2(0)(0|𝟎)+PX|Ui2(0)(1|𝟎)αi1\displaystyle=\frac{P^{(0)}_{X|U^{i-2}}(0|{\bf 0})}{P^{(0)}_{X|U^{i-2}}(0|{\bf 0})+P^{(0)}_{X|U^{i-2}}(1|{\bf 0})\alpha_{i}^{-1}} (214)

where p0:=PX|Ui1(0)(0|𝟎)PUi|XUi1(0)(0|0,𝟎)p_{0}:=P^{(0)}_{X|U^{i-1}}(0|{\bf 0})P^{(0)}_{U_{i}|XU^{i-1}}(0|0,{\bf 0}), p1:=PX|Ui1(0)(1|𝟎)PUi|XUi1(0)(0|1,𝟎)p_{1}:=P^{(0)}_{X|U^{i-1}}(1|{\bf 0})P^{(0)}_{U_{i}|XU^{i-1}}(0|1,{\bf 0}), and we used PX|Ui1(0)(0|𝟎)=PX|Ui2(0)(0|𝟎)P^{(0)}_{X|U^{i-1}}(0|{\bf 0})=P^{(0)}_{X|U^{i-2}}(0|{\bf 0}) which in turn follows from the factorization

PXYUi1|Ui2(0)\displaystyle P^{(0)}_{XYU_{i-1}|U^{i-2}} =PXY|Ui2(0)PUi1|YUi2(0)\displaystyle=P^{(0)}_{XY|U^{i-2}}P^{(0)}_{U_{i-1}|YU^{i-2}} (215)
=PX|Ui2(0)PY|Ui2(0)PUi1|YUi2(0).\displaystyle=P^{(0)}_{X|U^{i-2}}P^{(0)}_{Y|U^{i-2}}P^{(0)}_{U_{i-1}|YU^{i-2}}. (216)

Now from (212),

PUi|Ui1(0)(0|𝟎)\displaystyle P^{(0)}_{U_{i}|U^{i-1}}(0|{\bf 0}) =PX|Ui1(0)(0|𝟎)+αi1PX|Ui1(0)(1|𝟎)\displaystyle=P^{(0)}_{X|U^{i-1}}(0|{\bf 0})+\alpha_{i}^{-1}P^{(0)}_{X|U^{i-1}}(1|{\bf 0}) (217)
=PX(1)1jioddαj1+PX(0)PX(1)1ji2oddαj1+PX(0).\displaystyle=\frac{P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}{P_{X}(1)\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}. (218)

Similarly, by switching the roles of XX and YY we have

PUi1|Ui2(0)(0|𝟎)\displaystyle P^{(0)}_{U_{i-1}|U^{i-2}}(0|{\bf 0}) =PY(1)2ji1evenαj1+PY(0)PY(1)2ji3evenαj1+PY(0).\displaystyle=\frac{P_{Y}(1)\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)}{P_{Y}(1)\prod_{2\leq j\leq i-3}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)}. (219)

Therefore,

PUi(0)(𝟎)\displaystyle\quad P_{U^{i}}^{(0)}({\bf 0})
=1jiPUi|𝐔i1(0)(0|𝟎)\displaystyle=\prod_{1\leq j\leq i}P_{U_{i}|{\bf U}^{i-1}}^{(0)}(0|{\bf 0}) (220)
=[PX(1)1jioddαj1+PX(0)][PY(1)2jievenαj1+PY(0)]\displaystyle=\left[P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)\right]\left[P_{Y}(1)\prod_{2\leq j\leq i}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)\right] (221)

for any i=1,,ri=1,\dots,r. We also see from (212) and (221) that for any ii odd,

PXUi1(0,𝟎)(0)\displaystyle P^{(0)}_{XU^{i-1}(0,{\bf 0})} =PX(0)[PY(1)2ji1evenαj1+PY(0)]\displaystyle=P_{X}(0)\left[P_{Y}(1)\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)\right] (222)
=1m1[(11m2)2ji1evenαj1+1m2]\displaystyle=\frac{1}{m_{1}}\left[(1-\frac{1}{m_{2}})\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+\frac{1}{m_{2}}\right] (223)
1.1m12ji1evenαj1\displaystyle\leq\frac{1.1}{m_{1}}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1} (224)

where the last step follows since 2ji1evenαj110m2\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}\geq\frac{10}{m_{2}}. Therefore, the claim (56) follows, and (57) is similar.

Next, consider any i{1,,r}i\in\{1,\dots,r\}. Define

δi:=PXY|Ui1=𝟎(0,0)PX|Ui1=𝟎(0)PY|Ui1=𝟎(0)1.\displaystyle\delta_{i}:=\frac{P_{XY|U^{i-1}={\bf 0}}(0,0)}{P_{X|U^{i-1}={\bf 0}}(0)P_{Y|U^{i-1}={\bf 0}}(0)}-1. (225)

Observe that the construction of PUr|XYP_{U^{r}|XY} fulfills the Markov chain conditions (14)-(15), implying that

PXY(δ)(0,0)PXY(δ)(1,1)PXY(δ)(0,1)PXY(δ)(1,0)=PXY|Ui1(δ)(0,0|𝟎)PXY|Ui1(δ)(1,1|𝟎)PXY|Ui1(δ)(0,1|𝟎)PXY|Ui1(δ)(1,0|𝟎).\displaystyle\frac{P^{(\delta)}_{XY}(0,0)P^{(\delta)}_{XY}(1,1)}{P^{(\delta)}_{XY}(0,1)P^{(\delta)}_{XY}(1,0)}=\frac{P^{(\delta)}_{XY|U^{i-1}}(0,0|{\bf 0})P^{(\delta)}_{XY|U^{i-1}}(1,1|{\bf 0})}{P^{(\delta)}_{XY|U^{i-1}}(0,1|{\bf 0})P^{(\delta)}_{XY|U^{i-1}}(1,0|{\bf 0})}. (226)

We therefore have111We use the notation x¯:=1x\bar{x}:=1-x for x[0,1]x\in[0,1].

(1+δ)(1+δ(m11)(m21))(1δm11)(1δm21)=(1+δi)(1+δib(δ)c(δ)b¯(δ)c¯(δ))(1δib(δ)b¯(δ))(1δic(δ)c¯(δ))\displaystyle\frac{(1+\delta)(1+\frac{\delta}{(m_{1}-1)(m_{2}-1)})}{(1-\frac{\delta}{m_{1}-1})(1-\frac{\delta}{m_{2}-1})}=\frac{(1+\delta_{i})(1+\delta_{i}\frac{b^{(\delta)}c^{(\delta)}}{\bar{b}^{(\delta)}\bar{c}^{(\delta)}})}{(1-\delta_{i}\frac{b^{(\delta)}}{\bar{b}^{(\delta)}})(1-\delta_{i}\frac{c^{(\delta)}}{\bar{c}^{(\delta)}})} (227)

where we defined

b(δ)\displaystyle b^{(\delta)} :=PX|Ui1(δ)(0|𝟎);\displaystyle:=P_{X|U^{i-1}}^{(\delta)}(0|{\bf 0}); (228)
c(δ)\displaystyle c^{(\delta)} :=PY|Ui1(δ)(0|𝟎).\displaystyle:=P_{Y|U^{i-1}}^{(\delta)}(0|{\bf 0}). (229)

By continuity, we have

b(δ)\displaystyle b^{(\delta)} =b(0)+o(1);\displaystyle=b^{(0)}+o(1); (230)
c(δ)\displaystyle c^{(\delta)} =c(0)+o(1),\displaystyle=c^{(0)}+o(1), (231)

as δ0\delta\to 0. It is also easy to see from (227) that δi=O(δ)\delta_{i}=O(\delta) (for this proof, only δ\delta is the variable, and all other constants, such as mm and (αi)(\alpha_{i}), can be hidden in the Landau notations). Therefore (227)(230)(231) yield

1+(1+1m11)(1+1m21)δ+o(δ)\displaystyle\quad 1+\left(1+\frac{1}{m_{1}-1}\right)\left(1+\frac{1}{m_{2}-1}\right)\delta+o(\delta)
=1+(1+b(δ)b¯(δ))(1+c(δ)c¯(δ))δi+o(δ)\displaystyle=1+\left(1+\frac{b^{(\delta)}}{\bar{b}^{(\delta)}}\right)\left(1+\frac{c^{(\delta)}}{\bar{c}^{(\delta)}}\right)\delta_{i}+o(\delta) (232)
=1+(1+b(0)b¯(0))(1+c(0)c¯(0))δi+o(δ).\displaystyle=1+\left(1+\frac{b^{(0)}}{\bar{b}^{(0)}}\right)\left(1+\frac{c^{(0)}}{\bar{c}^{(0)}}\right)\delta_{i}+o(\delta). (233)

Using the fact that XX and YY are independent under P(0)P^{(0)}, noting (212) and the assumption 1jroddαj110m1\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}^{-1}\geq\frac{10}{m_{1}}, we have

b(0)=PX|Ui(0)(0|𝟎)1m110m1(11m1)+1m1110,\displaystyle b^{(0)}=P_{X|U^{i^{\prime}}}^{(0)}(0|{\bf 0})\leq\frac{\frac{1}{m_{1}}}{\frac{10}{m_{1}}(1-\frac{1}{m_{1}})+\frac{1}{m_{1}}}\leq\frac{1}{10}, (234)

where ii^{\prime} is the largest odd integer not exceeding ii. Similarly we also have c(0)110c^{(0)}\leq\frac{1}{10}. Consequently, (233) yields

δi2\displaystyle\delta_{i}^{2} δ2((1+1m11)(1+1m21)(1+19)2)2+o(δ2)\displaystyle\geq\delta^{2}\left(\frac{(1+\frac{1}{m_{1}-1})(1+\frac{1}{m_{2}-1})}{(1+\frac{1}{9})^{2}}\right)^{2}+o(\delta^{2}) (235)
0.94δ2+o(δ2).\displaystyle\geq 0.9^{4}\delta^{2}+o(\delta^{2}). (236)

Moreover, let us define

a(δ):=PUi|Ui1(δ)(0|𝟎).\displaystyle a^{(\delta)}:=P^{(\delta)}_{U_{i}|U^{i-1}}(0|{\bf 0}). (237)

In the following paragraph we consider arbitrary i{1,2,,r}2i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}, and we shall omit the superscripts (δ)(\delta) for a(δ)a^{(\delta)}, b(δ)b^{(\delta)}, c(δ)c^{(\delta)}, unless otherwise noted. Then

I(Ui;Y|Ui1=𝟎)\displaystyle\quad I(U_{i};Y|U^{i-1}={\bf 0})
=aD(PY|Ui=𝟎PY|Ui1=𝟎)\displaystyle=aD(P_{Y|U^{i}={\bf 0}}\|P_{Y|U^{i-1}={\bf 0}})
+a¯D(PY|Ui=1,Ui1=𝟎PY|Ui1=𝟎)\displaystyle\quad+\bar{a}D(P_{Y|U_{i}=1,U^{i-1}={\bf 0}}\|P_{Y|U^{i-1}={\bf 0}}) (238)

We can verify that PX|Ui=𝟎(0)=bb+αi1b¯P_{X|U^{i}={\bf 0}}(0)=\frac{b}{b+\alpha_{i}^{-1}\bar{b}}. Therefore,

PY|Ui=𝟎(0)\displaystyle P_{Y|U^{i}={\bf 0}}(0) =bb+αi1b¯c(1+δi1)\displaystyle=\frac{b}{b+\alpha_{i}^{-1}\bar{b}}\cdot c(1+\delta_{i-1})
+αi1b¯b+αi1b¯b¯cδi1bcb¯\displaystyle\quad+\frac{\alpha_{i}^{-1}\bar{b}}{b+\alpha_{i}^{-1}\bar{b}}\cdot\frac{\bar{b}c-\delta_{i-1}bc}{\bar{b}} (239)
=c+bc(1αi1)b+αi1b¯δi1.\displaystyle=c+\frac{bc(1-\alpha_{i}^{-1})}{b+\alpha_{i}^{-1}\bar{b}}\delta_{i-1}. (240)

Therefore as δ0\delta\to 0,

D(PY|Ui=𝟎PY|Ui1=𝟎)\displaystyle D(P_{Y|U^{i}={\bf 0}}\|P_{Y|U^{i-1}={\bf 0}}) =d(PY|Ui=𝟎(0)c)\displaystyle=d\left(P_{Y|U^{i}={\bf 0}}(0)\|c\right) (241)
=12c(b(αi1)αib+b¯δi1)2+o(δ2)\displaystyle=\frac{1}{2}c\left(\frac{b(\alpha_{i}-1)}{\alpha_{i}b+\bar{b}}\delta_{i-1}\right)^{2}+o(\delta^{2}) (242)
12c(b(αi1)0.92δ)2+o(δ2)\displaystyle\geq\frac{1}{2}c\left(b(\alpha_{i}-1)0.9^{2}\delta\right)^{2}+o(\delta^{2}) (243)
0.942cb2(αi1)δ2+o(δ2)\displaystyle\geq\frac{0.9^{4}}{2}cb^{2}(\alpha_{i}-1)\delta^{2}+o(\delta^{2}) (244)

where d(pq):=plogpq+(1p)log1p1qd(p\|q):=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q} denotes the binary divergence function, and recall that we assumed the natural base of logarithms. On the other hand, PY|Ui=1,Ui1=𝟎(0)=PY|X=1,Ui1=𝟎(0)=cδi1bc1bP_{Y|U_{i}=1,U^{i-1}={\bf 0}}(0)=P_{Y|X=1,U^{i-1}={\bf 0}}(0)=c-\frac{\delta_{i-1}bc}{1-b}. Therefore

D(PY|Ui=1,Ui1=𝟎PY|Ui1=𝟎)\displaystyle D(P_{Y|U_{i}=1,U^{i-1}={\bf 0}}\|P_{Y|U^{i-1}={\bf 0}}) =d(cδi1bc1bc)\displaystyle=d\left(c-\frac{\delta_{i-1}bc}{1-b}\|c\right) (245)
=12c(δi1b1b)2+o(δ2)\displaystyle=\frac{1}{2}c\left(\frac{\delta_{i-1}b}{1-b}\right)^{2}+o(\delta^{2}) (246)
0.942cb2δ2+o(δ2).\displaystyle\geq\frac{0.9^{4}}{2}cb^{2}\delta^{2}+o(\delta^{2}). (247)

Turning back to (238), we obtain

I(Ui;Y|Ui1=𝟎)\displaystyle\quad I(U_{i};Y|U^{i-1}={\bf 0})
=0.942[ab2c(αi1)2δ2+(1a)cb2δ2]+o(δ2)\displaystyle=\frac{0.9^{4}}{2}\left[ab^{2}c(\alpha_{i}-1)^{2}\delta^{2}+(1-a)cb^{2}\delta^{2}\right]+o(\delta^{2}) (248)
0.952(αi1(αi1)2+1αi1)b2cδ2+o(δ2)\displaystyle\geq\frac{0.9^{5}}{2}\left(\alpha_{i}^{-1}(\alpha_{i}-1)^{2}+1-\alpha_{i}^{-1}\right)b^{2}c\delta^{2}+o(\delta^{2}) (249)
0.952(αi1)b2cδ2+o(δ2)\displaystyle\geq\frac{0.9^{5}}{2}(\alpha_{i}-1)b^{2}c\delta^{2}+o(\delta^{2}) (250)

where (249) follows since (218) implies

a(δ)\displaystyle a^{(\delta)} =a(0)+o(1)\displaystyle=a^{(0)}+o(1) (251)
=(11m1)1jioddαj1+1m1(11m1)1ji2oddαj1+1m1+o(1)\displaystyle=\frac{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1) (252)
(11m1)1jioddαj1(11m1)1ji2oddαj1+o(1)\displaystyle\geq\frac{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}}+o(1) (253)
=αi1+o(1)\displaystyle=\alpha_{i}^{-1}+o(1) (254)

and

1a(δ)\displaystyle 1-a^{(\delta)} =1a(0)+o(1)\displaystyle=1-a^{(0)}+o(1) (255)
=(11m1)(1αi1)1ji2oddαj1(11m1)1ji2oddαj1+1m1+o(1)\displaystyle=\frac{(1-\frac{1}{m_{1}})(1-\alpha_{i}^{-1})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1) (256)
(11m1)(1αi1)10m1(11m1)10m1+1m1+o(1)\displaystyle\geq\frac{(1-\frac{1}{m_{1}})(1-\alpha_{i}^{-1})\frac{10}{m_{1}}}{(1-\frac{1}{m_{1}})\frac{10}{m_{1}}+\frac{1}{m_{1}}}+o(1) (257)
0.9(1αi1)+o(1).\displaystyle\geq 0.9(1-\alpha_{i}^{-1})+o(1). (258)

Moreover, by (212),

b(δ)\displaystyle b^{(\delta)} =b(0)+o(1)\displaystyle=b^{(0)}+o(1) (259)
=1m1(11m1)1ji1oddαj1+1m1+o(1)\displaystyle=\frac{\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1) (260)
1m1(11m1+110)1ji1oddαj1+o(1)\displaystyle\geq\frac{\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}}+\frac{1}{10})\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}}+o(1) (261)
0.9m11ji1oddαj+o(1).\displaystyle\geq\frac{0.9}{m_{1}}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}+o(1). (262)

Similarly,

c(δ)0.9m21ji1evenαj+o(1).\displaystyle c^{(\delta)}\geq\frac{0.9}{m_{2}}\prod_{1\leq j\leq i-1}^{\rm even}\alpha_{j}+o(1). (263)

Therefore by (250) and (221),

I(Ui;Y|Ui1)\displaystyle I(U_{i};Y|U^{i-1}) =I(Ui;Y|Ui1=𝟎)PUi1(𝟎)\displaystyle=I(U_{i};Y|U^{i-1}={\bf 0})P_{U^{i-1}}({\bf 0}) (264)
0.952(αi1)b2cδ21ji1αj1+o(δ2)\displaystyle\geq\frac{0.9^{5}}{2}(\alpha_{i}-1)b^{2}c\delta^{2}\prod_{1\leq j\leq i-1}\alpha_{j}^{-1}+o(\delta^{2}) (265)
0.98δ22m12m2(αi1)1ji1oddαj+o(δ2)\displaystyle\geq\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}(\alpha_{i}-1)\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}+o(\delta^{2}) (266)
=0.98δ22m12m2(1jioddαj1ji2oddαj)+o(δ2)\displaystyle=\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}\left(\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}-\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}\right)+o(\delta^{2}) (267)

and hence

1iroddI(Ui;Y|Ui1)\displaystyle\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1}) 0.98δ22m12m21jroddαj+o(δ2),\displaystyle\geq\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}+o(\delta^{2}), (268)

establishing the claim (58) of the theorem. The proof of (59) is similar.

Appendix B Proof of Theorem 6

We can choose the natural base of logarithms in this proof. Choose 𝐔=(U1,U2,,Ur){\bf U}=(U_{1},U_{2},\dots,U_{r}) satisfying the Markov chain conditions (14)-(15) and so that

s(X;Y)\displaystyle s_{\infty}^{*}(X;Y) 2I(X;Y)I(X;Y|𝐔)I(𝐔;X,Y)\displaystyle\leq 2\cdot\frac{I(X;Y)-I(X;Y|\bf U)}{I({\bf U};X,Y)} (269)
4I(X;Y)I(X;Y|𝐔)I(𝐔;X)+I(𝐔;Y)\displaystyle\leq 4\cdot\frac{I(X;Y)-I(X;Y|\bf U)}{I({\bf U};X)+I({\bf U};Y)} (270)

which is possible by the definition of s(X;Y)s_{\infty}^{*}(X;Y).

Given α,β[0,1]\alpha,\beta\in[0,1], define by Pα,βP^{\alpha,\beta} the unique distribution222Alternatively, Pα,βP^{\alpha,\beta} equals the II-projection argminQXYD(QXYPXY)\operatorname*{arg\,min}_{Q_{XY}}D(Q_{XY}\|P_{XY}) under the constraints QX=[α,α¯]Q_{X}=[\alpha,\bar{\alpha}] and QY=[β,β¯]Q_{Y}=[\beta,\bar{\beta}] [15, Corollary 3.3]. such that

Pα,β(x,y)=PXY(x,y)f(x)g(y)\displaystyle P^{\alpha,\beta}(x,y)=P_{XY}(x,y)f(x)g(y) (271)

for some functions ff and gg, and such that the marginals are Pα:=[α,α¯]P^{\alpha}:=[\alpha,\bar{\alpha}] and Pβ:=[β,β¯]P^{\beta}:=[\beta,\bar{\beta}]. For the existence of Pα,βP^{\alpha,\beta}, see e.g. [23, 40]. Define I(α,β)I(\alpha,\beta) as the mutual information of (X,Y)(X,Y) under Pα,βP^{\alpha,\beta}. Define λ=λ(α,β)\lambda=\lambda(\alpha,\beta) as the number such that Pα,βP^{\alpha,\beta} is the matrix

(αβ+λαβ¯λα¯βλβ¯β¯+λ).\displaystyle\left(\begin{array}[]{cc}\alpha\beta+\lambda&\alpha\bar{\beta}-\lambda\\ \bar{\alpha}\beta-\lambda&\bar{\beta}\bar{\beta}+\lambda\end{array}\right). (274)

Given any 𝐮\bf u, let α𝐮[0,1]\alpha_{\bf u}\in[0,1] be such that PX|𝐔=𝐮=[α𝐮,α¯𝐮]P_{X|{\bf U=u}}=[\alpha_{\bf u},\bar{\alpha}_{\bf u}]. Define β𝐮\beta_{\bf u} similarly but for PY|𝐔=𝐮P_{Y|{\bf U=u}}. With these notations, note that

𝔼[α𝐔]=𝔼[β𝐔]=p;\displaystyle\mathbb{E}[\alpha_{\bf U}]=\mathbb{E}[\beta_{\bf U}]=p; (275)

and

I(X;Y)I(X;Y|𝐔)=I(p,p)𝔼[I(α𝐔,β𝐔)];\displaystyle I(X;Y)-I(X;Y|{\bf U})=I(p,p)-\mathbb{E}[I(\alpha_{\bf U},\beta_{\bf U})]; (276)
I(𝐔;X)+I(𝐔;Y)=𝔼[d(α𝐔p)+d(β𝐔p)]\displaystyle I({\bf U};X)+I({\bf U};Y)=\mathbb{E}[d(\alpha_{\bf U}\|p)+d(\beta_{\bf U}\|p)] (277)

where we recall that d()d(\cdot\|\cdot) denotes the binary divergence function. Define

ψ(α,β):=d(αp)+d(βp).\displaystyle\psi(\alpha,\beta):=d(\alpha\|p)+d(\beta\|p). (278)

Then note that ψ(α,β)\psi(\alpha,\beta) is a smooth nonnegative function on [0,1]2[0,1]^{2} with vanishing value and first derivatives at (p,p)(p,p). Also, define

ϕ(α,β):=\displaystyle\quad\phi(\alpha,\beta):=
I(p,p)I(α,β)+Iα(p,p)(αp)+Iβ(p,p)(βp)\displaystyle I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)+I_{\beta}(p,p)(\beta-p) (279)

where Iα(p,p):=αI(α,β)|(p,p)I_{\alpha}(p,p):=\left.\frac{\partial}{\partial\alpha}I(\alpha,\beta)\right|_{(p,p)}. Then ϕ\phi is also a smooth function on [0,1]2[0,1]^{2} with vanishing value and first derivatives at (p,p)(p,p). Moreover, due to (275), we have

𝔼[ϕ(α𝐔,β𝐔)]=I(p,p)𝔼[I(α𝐔,β𝐔)].\displaystyle\mathbb{E}[\phi(\alpha_{\bf U},\beta_{\bf U})]=I(p,p)-\mathbb{E}[I(\alpha_{\bf U},\beta_{\bf U})]. (280)

Thus to prove the theorem it suffices to show the existence of sufficiently small c>0c>0, such that for any p,|δ|(0,c)p,|\delta|\in(0,c), there is

supα,βϕ(α,β)ψ(α,β)c1pδ2\displaystyle\sup_{\alpha,\beta}\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\leq c^{-1}p\delta^{2} (281)

where the sup is over α,β(0,1)\alpha,\beta\in(0,1).

  • Case 1: 0.1p<α,β<10p0.1p<\alpha,\beta<10p.
    Since 2α2D(αp)=[1α+11α]1α110p\frac{\partial^{2}}{\partial\alpha^{2}}D(\alpha\|p)=\left[\frac{1}{\alpha}+\frac{1}{1-\alpha}\right]\geq\frac{1}{\alpha}\geq\frac{1}{10p} for α[0,10p]\alpha\in[0,10p], we have

    ψ(α,β)120p[(αp)2+(βp)2]\displaystyle\psi(\alpha,\beta)\geq\frac{1}{20p}[(\alpha-p)^{2}+(\beta-p)^{2}] (282)

    for (α,β)[0,10p]2(\alpha,\beta)\in[0,10p]^{2}. Now if we can show that

    sup(α,β)[0,10p]22ϕ(α,β)\displaystyle\sup_{(\alpha,\beta)\in[0,10p]^{2}}\|\partial^{2}\phi(\alpha,\beta)\| =sup(α,β)[0,10p]22I(α,β)\displaystyle=\sup_{(\alpha,\beta)\in[0,10p]^{2}}\|\partial^{2}I(\alpha,\beta)\| (283)
    δ2,\displaystyle\lesssim\delta^{2}, (284)

    we will obtain sup(α,β)[0,10p]2ϕ(α,β)ψ(α,β)pδ2\sup_{(\alpha,\beta)\in[0,10p]^{2}}\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim p\delta^{2} which matches (281). Here and below, xyx\lesssim y means that there is an absolute constant C>0C>0 such that xCyx\leq Cy when cc in the theorem statement (and hence pp and |δ||\delta|) is sufficiently small.

    Before explicitly computing 2ϕ(α,β)\partial^{2}\phi(\alpha,\beta), we give some intuitions why we should expect (284) to be true. For fixed α,β,p\alpha,\beta,p, we will show that

    I(α,β)=I~(α,β)+o(δ2)\displaystyle I(\alpha,\beta)=\tilde{I}(\alpha,\beta)+o(\delta^{2}) (285)

    as δ0\delta\to 0, where we defined

    I~(α,β):=δ22p¯4αα¯ββ¯.\displaystyle\tilde{I}(\alpha,\beta):=\frac{\delta^{2}}{2\bar{p}^{4}}\alpha\bar{\alpha}\beta\bar{\beta}. (286)

    If the difference between I(α,β)I(\alpha,\beta) and I~(α,β)\tilde{I}(\alpha,\beta) could be neglected, then (284) should hold. To see (285), for given α,β(0,1)\alpha,\beta\in(0,1), note that (271) implies,

    (1+λαβ)(1+λα¯β¯)(1λαβ¯)(1λα¯β)=(1+δ)(1+δp2p¯2)(1δpp¯)2.\displaystyle\frac{(1+\frac{\lambda}{\alpha\beta})(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}})}{(1-\frac{\lambda}{\alpha\bar{\beta}})(1-\frac{\lambda}{\bar{\alpha}\beta})}=\frac{(1+\delta)(1+\frac{\delta p^{2}}{\bar{p}^{2}})}{(1-\frac{\delta p}{\bar{p}})^{2}}. (287)

    Under the assumption δ0\delta\to 0, the above linearizes to

    λαα¯ββ¯=δp¯2+o(δ).\displaystyle\frac{\lambda}{\alpha\bar{\alpha}\beta\bar{\beta}}=\frac{\delta}{\bar{p}^{2}}+o(\delta). (288)

    Moreover, note that

    Dχ2(Pα,βPα×Pβ)\displaystyle D_{\chi^{2}}(P^{\alpha,\beta}\|P^{\alpha}\times P^{\beta}) =λ22αα¯ββ¯\displaystyle=\frac{\lambda^{2}}{2\alpha\bar{\alpha}\beta\bar{\beta}} (289)
    =δ22p¯4αα¯ββ¯+o(δ2)\displaystyle=\frac{\delta^{2}}{2\bar{p}^{4}}\alpha\bar{\alpha}\beta\bar{\beta}+o(\delta^{2}) (290)

    where the last step follows by comparing with (288). Since I(α,β)/Dχ2(Pα,βPα×Pβ)1I(\alpha,\beta)/D_{\chi^{2}}(P^{\alpha,\beta}\|P^{\alpha}\times P^{\beta})\to 1 as δ0\delta\to 0, we see (285) holds. Of course, (285) does not really show (284) since approximation of function values generally does not imply approximation of the derivatives. However, we shall next explicitly take the derivatives to give a real proof, and the above observations are useful guides.

    First, note that

    I(α,β)α\displaystyle\frac{\partial I(\alpha,\beta)}{\partial\alpha}
    =x,y{0,1}(αPα,β(x,y))lnPα,β(x,y)Pα(x)Pβ(x)\displaystyle=\sum_{x,y\in\{0,1\}}\left(\frac{\partial}{\partial\alpha}P^{\alpha,\beta}(x,y)\right)\ln\frac{P^{\alpha,\beta}(x,y)}{P^{\alpha}(x)P^{\beta}(x)} (291)
    =(β+λα)ln(1+λαβ)+(β¯λα)ln(1λαβ¯)\displaystyle=(\beta+\lambda_{\alpha})\ln(1+\frac{\lambda}{\alpha\beta})+(\bar{\beta}-\lambda_{\alpha})\ln(1-\frac{\lambda}{\alpha\bar{\beta}})
    (βλα)ln(1λα¯β)+(β¯+λα)ln(1+λα¯β¯).\displaystyle\quad(-\beta-\lambda_{\alpha})\ln(1-\frac{\lambda}{\bar{\alpha}\beta})+(-\bar{\beta}+\lambda_{\alpha})\ln(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}). (292)

    where λα:=αλ\lambda_{\alpha}:=\frac{\partial}{\partial\alpha}\lambda. Next, we express the first and second derivatives, λα\lambda_{\alpha}, λβ\lambda_{\beta} and λα,β\lambda_{\alpha,\beta}, in terms of λ\lambda. Differentiating the logarithm of (287) in β\beta yields

    λβ[1αβ+λ+1α¯β¯+λ+1αβ¯λ+1α¯βλ]\displaystyle\quad\lambda_{\beta}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}\right]
    =λ[β1αβ+λβ¯1α¯β¯+λβ¯1αβ¯λ+β1α¯βλ].\displaystyle=\lambda\left[\frac{\beta^{-1}}{\alpha\beta+\lambda}-\frac{\bar{\beta}^{-1}}{\bar{\alpha}\bar{\beta}+\lambda}-\frac{\bar{\beta}^{-1}}{\alpha\bar{\beta}-\lambda}+\frac{\beta^{-1}}{\bar{\alpha}\beta-\lambda}\right]. (293)

    In the rest of the proof the notation f(t)=O(t)f(t)=O(t) means |f(t)||t||f(t)|\lesssim|t| (recall the definition of \lesssim in (284)), and f(t)=Θ(t)f(t)=\Theta(t) if 1f(t)/t11\lesssim f(t)/t\lesssim 1. Note that for 0.1p<α,β<10p0.1p<\alpha,\beta<10p we have

    λαβ=Θ(δ),\displaystyle\frac{\lambda}{\alpha\beta}=\Theta(\delta), (294)

    since the right side of (287) clearly equals 1+Θ(δ)1+\Theta(\delta). Then by (293),

    λβ[1αα¯ββ¯+O(λα2β2)]=λ[β¯βαα¯β2β¯2+O(λα2β3)]\displaystyle\lambda_{\beta}\left[\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+O(\frac{\lambda}{\alpha^{2}\beta^{2}})\right]=\lambda\left[\frac{\bar{\beta}-\beta}{\alpha\bar{\alpha}\beta^{2}\bar{\beta}^{2}}+O(\frac{\lambda}{\alpha^{2}\beta^{3}})\right] (295)

    and hence,

    λβ=λβ¯βββ¯(1+O(λαβ))=O(λβ).\displaystyle\lambda_{\beta}=\lambda\cdot\frac{\bar{\beta}-\beta}{\beta\bar{\beta}}\left(1+O(\frac{\lambda}{\alpha\beta})\right)=O(\frac{\lambda}{\beta}). (296)

    Expression of λα\lambda_{\alpha} can be found similarly. Moreover, differentiating (293) we get

    λα,β[1αβ+λ+1α¯β¯+λ+1αβ¯λ+1α¯βλ]\displaystyle\lambda_{\alpha,\beta}\left[\tfrac{1}{\alpha\beta+\lambda}+\tfrac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\tfrac{1}{\alpha\bar{\beta}-\lambda}+\tfrac{1}{\bar{\alpha}\beta-\lambda}\right]
    +λαλβ[1(αβ+λ)21(α¯β¯+λ)2+1(αβ¯λ)2+1(α¯βλ)2]\displaystyle+\lambda_{\alpha}\lambda_{\beta}\left[-\tfrac{1}{(\alpha\beta+\lambda)^{2}}-\tfrac{1}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{1}{(\alpha\bar{\beta}-\lambda)^{2}}+\tfrac{1}{(\bar{\alpha}\beta-\lambda)^{2}}\right]
    +λα[α(αβ+λ)2+α¯(α¯β¯+λ)2+α(αβ¯λ)2α¯(α¯βλ)2]\displaystyle+\lambda_{\alpha}\left[-\tfrac{\alpha}{(\alpha\beta+\lambda)^{2}}+\tfrac{\bar{\alpha}}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{\alpha}{(\alpha\bar{\beta}-\lambda)^{2}}-\tfrac{\bar{\alpha}}{(\bar{\alpha}\beta-\lambda)^{2}}\right]
    +λβ[β(αβ+λ)2+β¯(α¯β¯+λ)2+β(α¯βλ)2β¯(αβ¯λ)2]\displaystyle+\lambda_{\beta}\left[-\tfrac{\beta}{(\alpha\beta+\lambda)^{2}}+\tfrac{\bar{\beta}}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{\beta}{(\bar{\alpha}\beta-\lambda)^{2}}-\tfrac{\bar{\beta}}{(\alpha\bar{\beta}-\lambda)^{2}}\right]
    +λ[1(αβ+λ)2+1(α¯β¯+λ)21(αβ¯λ)21(α¯βλ)2]=0,\displaystyle+\lambda\left[\tfrac{1}{(\alpha\beta+\lambda)^{2}}+\tfrac{1}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}-\tfrac{1}{(\alpha\bar{\beta}-\lambda)^{2}}-\tfrac{1}{(\bar{\alpha}\beta-\lambda)^{2}}\right]=0, (297)

    from which we can deduce that

    λα,β=O(λαβ).\displaystyle\lambda_{\alpha,\beta}=O(\frac{\lambda}{\alpha\beta}). (298)

    Now, taking the derivative in (292), we obtain

    α,βI(α,β)\displaystyle\quad\partial_{\alpha,\beta}I(\alpha,\beta)
    =λα,βλ1αα¯ββ¯+λαλβ1αα¯ββ¯+λαλββ¯αα¯β2β¯2\displaystyle=\lambda_{\alpha,\beta}\lambda\cdot\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+\lambda_{\alpha}\lambda_{\beta}\cdot\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+\lambda_{\alpha}\lambda\cdot\frac{\beta-\bar{\beta}}{\alpha\bar{\alpha}\beta^{2}\bar{\beta}^{2}}
    +λβλαα¯α2α¯2ββ¯+λ22(α¯α)(β¯β)α2α¯2β2β¯2\displaystyle\quad+\lambda_{\beta}\lambda\cdot\frac{\alpha-\bar{\alpha}}{\alpha^{2}\bar{\alpha}^{2}\beta\bar{\beta}}+\frac{\lambda^{2}}{2}\cdot\frac{(\bar{\alpha}-\alpha)(\bar{\beta}-\beta)}{\alpha^{2}\bar{\alpha}^{2}\beta^{2}\bar{\beta}^{2}}
    +O(λ3α3β3).\displaystyle\quad+O\left(\frac{\lambda^{3}}{\alpha^{3}\beta^{3}}\right). (299)

    In deriving (299), we applied the Taylor expansions of xln(1+x)x\mapsto\ln(1+x) and x11+xx\mapsto\frac{1}{1+x}. Plugging (296) and (298) into (299), we obtain

    |α,βI(α,β)|=O((λαβ)2)=O(δ2).\displaystyle|\partial_{\alpha,\beta}I(\alpha,\beta)|=O(\left(\frac{\lambda}{\alpha\beta}\right)^{2})=O(\delta^{2}). (300)

    Next, we control |α,αI(α,β)||\partial_{\alpha,\alpha}I(\alpha,\beta)|. Similarly to (293), we have

    λα[1αβ+λ+1α¯β¯+λ+1α¯βλ+1αβ¯λ]\displaystyle\quad\lambda_{\alpha}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}\right]
    =λ[α1αβ+λα¯1α¯β¯+λα¯1α¯βλ+α1αβ¯λ].\displaystyle=\lambda\left[\frac{\alpha^{-1}}{\alpha\beta+\lambda}-\frac{\bar{\alpha}^{-1}}{\bar{\alpha}\bar{\beta}+\lambda}-\frac{\bar{\alpha}^{-1}}{\bar{\alpha}\beta-\lambda}+\frac{\alpha^{-1}}{\alpha\bar{\beta}-\lambda}\right]. (301)

    Further taking the derivative, we obtain

    λα,α[1αβ+λ+1α¯β¯+λ+1α¯βλ+1αβ¯λ]\displaystyle\quad\lambda_{\alpha,\alpha}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}\right]
    +λα[2βα1λ(αβ+λ)2+2β¯+α¯1λ(α¯β¯+λ)2\displaystyle+\lambda_{\alpha}\left[\frac{-2\beta-\alpha^{-1}\lambda}{(\alpha\beta+\lambda)^{2}}+\frac{2\bar{\beta}+\bar{\alpha}^{-1}\lambda}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.
    +2βα¯1λ(α¯βλ)2+2β¯+α1λ(αβ¯λ)2]\displaystyle\quad\quad+\left.\frac{2\beta-\bar{\alpha}^{-1}\lambda}{(\bar{\alpha}\beta-\lambda)^{2}}+\frac{-2\bar{\beta}+\alpha^{-1}\lambda}{(\alpha\bar{\beta}-\lambda)^{2}}\right]
    +λ2[1α2(αβ+λ)2+1α¯2(α¯β¯+λ)2\displaystyle+\lambda^{2}\left[\frac{1}{\alpha^{2}(\alpha\beta+\lambda)^{2}}+\frac{1}{\bar{\alpha}^{2}(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.
    1α¯2(α¯βλ)21α2(αβ¯λ)2]\displaystyle\quad\quad\left.-\frac{1}{\bar{\alpha}^{2}(\bar{\alpha}\beta-\lambda)^{2}}-\frac{1}{\alpha^{2}(\alpha\bar{\beta}-\lambda)^{2}}\right]
    +2λ[βα(αβ+λ)2+β¯α¯(α¯β¯+λ)2\displaystyle+2\lambda\left[\frac{\beta}{\alpha(\alpha\beta+\lambda)^{2}}+\frac{\bar{\beta}}{\bar{\alpha}(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.
    +βα¯(α¯βλ)2+β¯α(αβ¯λ)2]=0.\displaystyle\quad\quad\left.+\frac{\beta}{\bar{\alpha}(\bar{\alpha}\beta-\lambda)^{2}}+\frac{\bar{\beta}}{\alpha(\alpha\bar{\beta}-\lambda)^{2}}\right]=0. (302)

    Next, we shall use the assumption of α,β(0.1p,10p)\alpha,\beta\in(0.1p,10p) to simplify (302) as

    λα,αΘ(1p2)λαΘ(1p3)+λ2Θ(1p6)+λΘ(1p4)=0.\displaystyle\lambda_{\alpha,\alpha}\cdot\Theta(\frac{1}{p^{2}})-\lambda_{\alpha}\cdot\Theta(\frac{1}{p^{3}})+\lambda^{2}\cdot\Theta(\frac{1}{p^{6}})+\lambda\cdot\Theta(\frac{1}{p^{4}})=0. (303)

    Since, analogous to (296), we have

    λα=O(λα),\displaystyle\lambda_{\alpha}=O(\frac{\lambda}{\alpha}), (304)

    we see that (303) implies

    λα,α=O(λp2).\displaystyle\lambda_{\alpha,\alpha}=O(\frac{\lambda}{p^{2}}). (305)

    Tighter estimates of λα,α\lambda_{\alpha,\alpha} are possible, but the above will suffice. We now take the derivative of (292) in α\alpha:

    α,αI(α,β)=I1+I2\displaystyle\partial_{\alpha,\alpha}I(\alpha,\beta)=I_{1}+I_{2} (306)

    where

    I1:\displaystyle I_{1}: =λα,α[ln(1+λαβ)ln(1λαβ¯)\displaystyle=\lambda_{\alpha,\alpha}\left[\ln(1+\frac{\lambda}{\alpha\beta})-\ln(1-\frac{\lambda}{\alpha\bar{\beta}})\right.
    ln(1λα¯β)+ln(1+λα¯β¯)]\displaystyle\quad\quad\left.-\ln(1-\frac{\lambda}{\bar{\alpha}\beta})+\ln(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}})\right] (307)

    and

    I2\displaystyle I_{2} :=λα2(1αβ1+λαβ+1αβ¯1λαβ¯+1α¯β1λα¯β+1α¯β¯1+λα¯β¯)\displaystyle:=\lambda_{\alpha}^{2}\left(\frac{\frac{1}{\alpha\beta}}{1+\frac{\lambda}{\alpha\beta}}+\frac{\frac{1}{\alpha\bar{\beta}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}\beta}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{\frac{1}{\bar{\alpha}\bar{\beta}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)
    +λα(1α1+λαβ1α1λαβ¯+1α¯1λα¯β1α¯1+λα¯β¯)\displaystyle+\lambda_{\alpha}\left(\frac{\frac{1}{\alpha}}{1+\frac{\lambda}{\alpha\beta}}-\frac{\frac{1}{\alpha}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}}}{1-\frac{\lambda}{\bar{\alpha}\beta}}-\frac{\frac{1}{\bar{\alpha}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)
    +λαλ(1α2β1+λαβ+1α2β¯1λαβ¯+1α¯2β1λα¯β+1α¯2β¯1+λα¯β¯)\displaystyle+\lambda_{\alpha}\lambda\left(\frac{-\frac{1}{\alpha^{2}\beta}}{1+\frac{\lambda}{\alpha\beta}}+\frac{-\frac{1}{\alpha^{2}\bar{\beta}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}^{2}\beta}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{\frac{1}{\bar{\alpha}^{2}\bar{\beta}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)
    +λ(1α21+λαβ+1α21λαβ¯+1α¯21λα¯β+1α¯21+λα¯β¯)\displaystyle+\lambda\left(\frac{-\frac{1}{\alpha^{2}}}{1+\frac{\lambda}{\alpha\beta}}+\frac{\frac{1}{\alpha^{2}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}^{2}}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{-\frac{1}{\bar{\alpha}^{2}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right) (308)

    We can Taylor expand I1I_{1} using the facts that λα,α=O(λp2)\lambda_{\alpha,\alpha}=O(\frac{\lambda}{p^{2}}), α,β=Θ(p)\alpha,\beta=\Theta(p), to obtain

    I1=O(λ2p4).\displaystyle I_{1}=O(\frac{\lambda^{2}}{p^{4}}). (309)

    We can Taylor expand I2I_{2} using the fact that λα=O(λp)\lambda_{\alpha}=O(\frac{\lambda}{p}) (analogous to (296)) to obtain

    I2=O(λ2p4).\displaystyle I_{2}=O(\frac{\lambda^{2}}{p^{4}}). (310)

    Thus

    |α,αI(α,β)|=O(λ2p4)=O(δ2).\displaystyle|\partial_{\alpha,\alpha}I(\alpha,\beta)|=O(\frac{\lambda^{2}}{p^{4}})=O(\delta^{2}). (311)

    By symmetry same bound holds for |β,βI(α,β)||\partial_{\beta,\beta}I(\alpha,\beta)| as well. Together with (300), we thus validated (284), and consequently (281) in this case.

  • Case 2: max{α,β}10p\max\{\alpha,\beta\}\geq 10p.
    Without loss of generality assume that αβ\alpha\geq\beta and α10p\alpha\geq 10p. From (292), we have

    αI(p,p)\displaystyle\quad\partial_{\alpha}I(p,p)
    =(p+λα(p,p))ln(1+δ)+(p¯λα(p,p))ln(1δp/p¯)\displaystyle=(p+\lambda_{\alpha}(p,p))\ln(1+\delta)+(\bar{p}-\lambda_{\alpha}(p,p))\ln(1-\delta p/\bar{p})
    +(pλα(p,p))ln(1δp/p¯)\displaystyle\quad+(-p-\lambda_{\alpha}(p,p))\ln(1-\delta p/\bar{p})
    +(p¯+λα(p,p))ln(1+δp2/p¯2).\displaystyle\quad+(-\bar{p}+\lambda_{\alpha}(p,p))\ln(1+\delta p^{2}/\bar{p}^{2}). (312)

    Using (304) with λδp2\lambda\leftarrow\delta p^{2} and αp\alpha\leftarrow p, we obtain λα(p,p)=O(pδ)\lambda_{\alpha}(p,p)=O(p\delta). Thus Taylor expanding the above displayed, we obtain

    αI(p,p)pδ2.\displaystyle\partial_{\alpha}I(p,p)\lesssim p\delta^{2}. (313)

    Then

    ϕ(α,β)\displaystyle\phi(\alpha,\beta) :=I(p,p)I(α,β)+Iα(p,p)(αp)\displaystyle:=I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)
    +Iβ(p,p)(βp)\displaystyle\quad+I_{\beta}(p,p)(\beta-p) (314)
    p2δ20+2Iα(p,p)(αp)\displaystyle\leq p^{2}\delta^{2}-0+2I_{\alpha}(p,p)(\alpha-p) (315)
    pαδ2\displaystyle\lesssim p\alpha\delta^{2} (316)

    where we used the assumption that αβ\alpha\geq\beta and the fact that Iα(p,p)=Iβ(p,p)I_{\alpha}(p,p)=I_{\beta}(p,p). Now the assumption of α10p\alpha\geq 10p implies

    ψ(α,β)d(αp)α.\displaystyle\psi(\alpha,\beta)\geq d(\alpha\|p)\gtrsim\alpha. (317)

    To see the second inequality in (317), note that

    minα(10p,1]1αd(αp)\displaystyle\min_{\alpha\in(10p,1]}\frac{1}{\alpha}d(\alpha\|p) =minα[10p,1]{lnαp+1ααln1α1p}\displaystyle=\min_{\alpha\in[10p,1]}\left\{\ln\frac{\alpha}{p}+\frac{1-\alpha}{\alpha}\ln\frac{1-\alpha}{1-p}\right\} (318)
    =d(10pp)10p\displaystyle=\frac{d(10p\|p)}{10p} (319)

    where the minimization is easily solved by checking that the derivative is positive for α10p\alpha\geq 10p. Finally, combining (317) with (316), we obtain ϕ(α,β)ψ(α,β)δ2p\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim\delta^{2}p, as desired.

  • Case 3: min{α,β}0.1p\min\{\alpha,\beta\}\leq 0.1p, max{α,β}<10p\max\{\alpha,\beta\}<10p.
    Assume without loss of generality that α0.1p\alpha\leq 0.1p. In this case, using (313), we have

    ϕ(α,β)\displaystyle\phi(\alpha,\beta) :=I(p,p)I(α,β)+Iα(p,p)(αp)\displaystyle:=I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)
    +Iβ(p,p)(βp)\displaystyle\quad+I_{\beta}(p,p)(\beta-p) (320)
    p2δ20+Iα(p,p)(α+β2p)\displaystyle\leq p^{2}\delta^{2}-0+I_{\alpha}(p,p)(\alpha+\beta-2p) (321)
    p2δ2+Iα(p,p)18p\displaystyle\leq p^{2}\delta^{2}+I_{\alpha}(p,p)\cdot 18p (322)
    =O(p2δ2).\displaystyle=O(p^{2}\delta^{2}). (323)

    On the other hand,

    ψ(α,β)\displaystyle\psi(\alpha,\beta) d(αp)\displaystyle\geq d(\alpha\|p) (324)
    d(0.1pp)\displaystyle\geq d(0.1p\|p) (325)
    =(0.90.1ln10)p+O(p2)\displaystyle=(0.9-0.1\ln 10)p+O(p^{2}) (326)
    =Θ(p).\displaystyle=\Theta(p). (327)

    Thus we once again obtain ϕ(α,β)ψ(α,β)δ2p\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim\delta^{2}p, as desired.

References

  • [1] Jayadev Acharya, Clement Canonne, Aditya Vikram Singh, and Himanshu Tyagi. Optimal rates for nonparametric density estimation under communication constraints. Advances in Neural Information Processing Systems, 34, 2021.
  • [2] Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints II: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 66(12):7856–7877, 2020.
  • [3] Rudolf Ahlswede and Marat V Burnashev. On minimax estimation in the presence of side information about remote data. Annals of Statistics, pages 141–171, 1990.
  • [4] Rudolf Ahlswede and Imre Csiszár. Hypothesis testing with communication constraints. IEEE Transactions on Information Theory, 32(4), 1986.
  • [5] Rudolf Ahlswede and Peter Gács. Spreading of sets in product spaces and hypercontraction of the Markov operator. The Annals of Probability, 4:925–939, 1976.
  • [6] Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv:1304.6133, 2013.
  • [7] Leighton Pate Barnes, Yanjun Han, and Ayfer Özgür. Lower bounds for learning distributions under communication constraints via fisher information. The Journal of Machine Learning Research, 21(1):9583–9612, 2020.
  • [8] Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3):1352, 2018.
  • [9] M. Braverman and A. Rao. Information equals amortized communication. In FOCS, pages 748–757, 2011.
  • [10] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020, 2016.
  • [11] T. Tony Cai and Hongji Wei. Distributed Gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms. arXiv:2001.08877, 2020.
  • [12] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming distance. SIAM Journal on Computing, 41(5):1299–1317, 2012.
  • [13] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao. Informational complexity and the direct sum problem for simultaneous message complexity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 270–278. IEEE, 2001.
  • [14] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. 2nd Edition, Wiley, July 2006.
  • [15] I. Csiszar. I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1):146–158, February, 1975.
  • [16] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
  • [17] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.
  • [18] Peter Kairouz et al. Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning Series. 2021.
  • [19] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Communication complexity of estimating correlations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 792–803, 2019.
  • [20] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Error exponents in distributed hypothesis testing of correlations. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2674–2678, 2019.
  • [21] Yanjun Han, Pritam Mukherjee, Ayfer Ozgur, and Tsachy Weissman. Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 506–510. IEEE, 2018.
  • [22] Yanjun Han, Ayfer Özgür, and Tsachy Weissman. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pages 3163–3188. PMLR, 2018.
  • [23] Charles Hobby and Ronald Pyke. Doubly stochastic operators obtained from positive operators. Pacific Journal of Mathematics, 15(1):153–157, 1965.
  • [24] Piotr Indyk and David Woodruff. Tight lower bounds for the distinct elements problem. In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pages 283–288, 2003.
  • [25] Amiram Kaspi. Two-way source coding with a fidelity criterion. IEEE Transactions on Information Theory, 31(6):735–740, 1985.
  • [26] Eyal Kushilevitz and Noam Nisan. Communication Complexity. Cambridge University Press, 1997.
  • [27] Jingbo Liu. Information Theory from A Functional Viewpoint. Ph.D. thesis, Princeton University, Princeton, NJ, 2018.
  • [28] Jingbo Liu. Communication complexity of two-party nonparametric global density estimation. In 56th Annual Conference on Information Sciences and Systems (CISS), 2022.
  • [29] Jingbo Liu. Interaction improves two-party nonparametric pointwise density estimation. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 904–909, 2022.
  • [30] Jingbo Liu, Thomas A Courtade, Paul Cuff, and Sergio Verdú. Smoothing Brascamp-Lieb inequalities and strong converses for common randomness generation. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1043–1047.
  • [31] Jingbo Liu, Paul Cuff, and Sergio Verdú. Key generation with limited interaction. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2918–2922. IEEE, 2016.
  • [32] Jingbo Liu, Paul Cuff, and Sergio Verdú. Secret key generation with limited interaction. IEEE Transactions on Information Theory, 63(11):7358–7381, 2017.
  • [33] Nan Ma and Prakash Ishwar. Some results on distributed source coding for interactive function computation. IEEE Transactions on Information Theory, 57(9):6180–6195, 2011.
  • [34] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
  • [35] Noam Nisan and Avi Widgerson. Rounds in communication complexity revisited. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 419–429, 1991.
  • [36] Alon Orlitsky. Worst-case interactive communication. I. two messages are almost optimal. IEEE Transactions on Information Theory, 36(5):1111–1126, 1990.
  • [37] Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
  • [38] K. R. Sahasranand and Himanshu Tyagi. Extra samples can reduce the communication for independence testing. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 2316–2320, 2018.
  • [39] Igal Sason and Sergio Verdú. ff -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
  • [40] Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
  • [41] Charles J Stone. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, pages 1348–1360, 1980.
  • [42] Madhu Sudan, Badih Ghazi, Noah Golowich, and Mitali Bafna. Communication-rounds tradeoffs for common randomness and secret key generation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1861–1871. SIAM, 2019.
  • [43] Madhu Sudan, Himanshu Tyagi, and Shun Watanabe. Communication for generating correlation: A unifying survey. IEEE Transactions on Information Theory, 66(1):5–37, 2019.
  • [44] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
  • [45] Paxton Turner, Jingbo Liu, and Philippe Rigollet. Efficient interpolation of density estimators. In International Conference on Artificial Intelligence and Statistics, pages 2503–2511. PMLR, 2021.
  • [46] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, pages 2512–2520. PMLR, 2021.
  • [47] Himanshu Tyagi. Common information and secret key capacity. IEEE Transactions on Information Theory, 59(9):5627–5640, 2013.
  • [48] Yu Xiang and Young-Han Kim. Interactive hypothesis testing against independence. In 2013 IEEE International Symposium on Information Theory, pages 2840–2844, 2013.
  • [49] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
  • [50] Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213, 1979.
  • [51] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, pages 423–435, 1996.
  • [52] Yuchen Zhang, John C Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328–2336, 2013.
  • [53] Yuancheng Zhu and John Lafferty. Distributed nonparametric regression under communication constraints. In International Conference on Machine Learning, pages 6009–6017, 2018.
  • [54] Yuancheng Zhu and John Lafferty. Quantized minimax estimation over Sobolev ellipsoids. Information and Inference: A Journal of the IMA, 7(1):31–82, 2018.
Jingobo Liu received the B.S. degree in Electrical Engineering from Tsinghua University, Beijing, China in 2012, and the M.A. and Ph.D. degrees in Electrical Engineering from Princeton University, Princeton, NJ, USA, in 2014 and 2017. He was a Norbert Wiener Postdoctoral Research Fellow at the MIT Institute for Data, Systems, and Society (IDSS) during 2018-2020. Since 2020, he has been an assistant professor in the Department of Statistics and an affiliate in the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign, IL, USA. His research interests include information theory, high dimensional statistics and probability, coding theory, and the related fields. His undergraduate thesis received the best undergraduate thesis award at Tsinghua University (2012). He gave a semi-plenary presentation at the 2015 IEEE Int. Symposium on Information Theory, Hong-Kong, China. He was a recipient of the Princeton University Wallace Memorial Honorific Fellowship in 2016. His Ph.D. thesis received the Bede Liu Best Dissertation Award of Princeton and the Thomas M. Cover Dissertation Award of the IEEE Information Theory Society (2018).