A Few Interactions Improve Distributed Nonparametric Estimation, Optimally

Jingbo Liu Jingbo Liu is with the Department of Statistics and the Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL, 61820, US. Email: [email protected] received 16-Feb-2022; revised 27-Nov-2022. Associate editor: Himanshu Tyagi.This paper was presented in part at 2022 IEEE International Symposium on Information Theory (ISIT) [29].

Abstract

Consider the problem of nonparametric estimation of an unknown $\beta$ -Hölder smooth density $p_{XY}$ at a given point, where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d. samples $(X_{i},Y_{i})$ are generated according to this distribution, and two terminals observe $(X_{i})$ and $(Y_{i})$ , respectively. They are allowed to exchange $k$ bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order $\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}}$ for one-way protocols and $k^{-\frac{2\beta}{d+2\beta}}$ for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of $k$ . The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables, and the lower bound is built on a sharp estimate of a symmetric strong data processing constant for biased Bernoulli variables.

Index Terms:

Density estimation, Communication complexity, Nonparametric statistics, Learning with system constraints, Strong data processing constant

I Introduction

The communication complexity problem was introduced in the seminal paper of Yao [50] (see also [26] for a survey), where two terminals (which we call Alice and Bob) compute a given Boolean function of their local inputs ( ${\bf X}=(X_{i})_{i=1}^{n}$ and ${\bf Y}=(Y_{i})_{i=1}^{n}$ ) by means of exchanging messages. The famous log-rank conjecture provides an estimate of the communication complexity of a general Boolean function, which is still open to date. Meanwhile, communication complexity of certain specific functions can be better understood. For example, the Gap-Hamming problem [24][12] concerns testing $f({\bf X,Y})>\frac{1}{n}$ against $f({\bf X,Y})<-\frac{1}{n}$ , where $f({\bf X,Y}):=\frac{1}{n}\sum_{i=1}^{n}X_{i}Y_{i}$ denotes the sample correlation and $X_{i},Y_{i}\in\{+1,-1\}$ . It was shown in [12] with a geometric argument that the communication complexity (for worst-case deterministic ${\bf X,Y}$ ) is $\Theta(n)$ , therefore a one-way protocol where Alice simply sends ${\bf X}$ cannot be improved (up to a multiplicative constant) by an interactive protocol.

Gap-Hamming is closely related to the problem of estimating the joint distribution of a pair of binary or Gaussian random variables (using $n$ i.i.d. samples). Indeed, for $n$ large we may assume that Alice (resp. Bob) can estimate the marginal distributions of $X$ (resp. $Y$ ) very well, so that the joint distribution is parameterized by only one scalar which is the correlation. An information-theoretic proof of Gap-Hamming was previously provided in [19], building on a converse for correlation estimation for the binary symmetric distribution, and pinned down the exact prefactor in the risk-communication tradeoff. In particular, the result of [19] implies that the naive algorithm where Alice simply sends $\bf X$ can be improved by a constant factor in the estimation risk by a more sophisticated scheme using additional samples. For the closely related problem of correlation (distribution) testing, [38] and [48] provided asymptotically tight bounds on the communication complexity under the one-way and interactive protocols when the null hypothesis is the independent distribution (zero correlation), which also implies that the error exponent can be improved by an algorithm using additional samples. The technique of [48] is based on the tensorization of internal and external information ((20) ahead), whereas the bound of [38] uses hypercontractivity. More recently, [20] derived bounds for testing against dependent distributions using optimal transport inequalities.

In this paper, we take the natural step of introducing nonparametric (NP) statistics to Alice and Bob, whereby two parties estimate a nonparametric density by means of sending messages interactively. It will be seen that this problem is closely related to a “sparse” version of the aforementioned Gap-Hamming problem, where interaction does help, in contrast to the usual Gap-Hamming problem.

For concreteness, consider the problem of nonparametric estimation of an unknown $\beta$ -Hölder smooth density $p_{XY}$ at a given point $(x_{0},y_{0})$ . For simplicity we assume the symmetric case where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d. samples $(X_{i},Y_{i})$ are generated according to $p_{XY}$ , and Alice and Bob observe $(X_{i})$ and $(Y_{i})$ , respectively. After they exchange $k$ bits (either in one-way or interactively), Bob estimates the unknown density at the given point. We successfully characterize the minimax rate in terms of the communication complexity $k$ : it is order $\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}}$ for one-way protocols and $k^{-\frac{2\beta}{d+2\beta}}$ for interactive protocols.

Notably, allowing interaction strictly improves the estimation risk. Previously, separations between one-way and interactive protocols are known but in very different contexts: In [32, Corollary 1] (see also [31]), the separation was found in the rate region of common randomness generation from biased binary distributions, using certain convexity arguments, but this only implies a difference in the leading constant, rather than the asymptotic scaling. On the other hand, the example distribution in [42] is based on the pointer-chasing construction of [35], which appears to be a highly artificial distribution designed to entail a separation between the one-way and interactive protocols. Another example where interaction improves zero-error source coding with side information, based on a “bit location” algorithm, was described in [36], and it was shown that two-way communication complexity differs from interactivity communication complexity only by constant factors. In contrast, the logarithmic separation in the present paper arises from the nonparametric nature of the problem: If we consider the problem of correlation estimation for Bernoulli pairs with a fixed bias (a parametric problem), the risk will be order $k^{-\frac{1}{2}}$ , and there will be no separation between one-way and interactive protocols (which is indeed the case in [19]). In contrast, nonparametric estimation is analogous to Bernoulli correlation estimation where the bias changes with $k$ (since the optimal bandwidth adapts to $k$ ), which gives rise to the separation.

For the risk upper bound, in the one-way setting it is efficient for Alice to just encode the set of $i$ ’s such that $X_{i}$ falls within a neighborhood (computed by the optimal bandwidth for a given $k$ ) of the given point $x_{0}$ . To achieve the optimal $k^{-\frac{2\beta}{d+2\beta}}$ rate for interactive protocols, we provide a novel scheme that uses $r>1$ rounds of interactions, where $r=r(k)$ grows as slowly as the super logarithm (i.e. the inverse of tetration) of $k$ . With the sequence of $r(k)$ we use in Section V-C (and suppose that $\beta=d=1$ ), while $r=4$ interactions is barely enough for $k$ equal to the number of letters in a short sentence, $r=8$ is more than sufficient for $k$ equal to the number of all elementary particles in the entire observable universe. Thus from a practical perspective, $r(k)$ is effectively a constant, although it remains an interesting theoretical question whether $r(k)$ really diverges (Conjecture 1).

For the lower bound, the proof is based on the symmetric data processing constant introduced in [32]. Previously, the data processing constant $s^{*}_{r}$ has been connected to two-party estimation and hypothesis testing in [19]; the idea was canonized as the following statement: “Information for hypothesis testing locally” is upper bounded by $s^{*}_{r}$ times “Information communicated mutually”. However, $s^{*}_{r}$ is not easy to compute, and previous bounds on $s^{*}_{r}$ are also not tight enough for our purpose. Instead, we first use an idea of simulation of continuous variables to reduce the problem to estimation of Bernoulli distributions, for which $s^{*}_{r}$ is easier to analyze. Then we use some new arguments to bound $s^{*}_{\infty}$ .

Let us emphasize that this paper concerns density estimation at a given point, rather than estimating the global density function. For the latter problem, it is optimal for Alice to just quantize the samples and send it to Bob, which we show in the companion paper [28] . The mean square error (in $\ell_{2}$ norm) of estimating global density function scales differently than the case of point-wise density estimation since the messages cannot be tailored to the given point.

Related work. Besides function computation, distribution estimation and testing, other problems which have been studied in the communication complexity or privacy settings include lossy source coding [25] and common randomness or secret key generation [47][30][32][43]. The key technical tool for interactive two-way communication models, namely the tensorization of internal and external information ((20) ahead), appeared in [25] for lossy compression, [9][33] for function computation, [47][32] for common randomness generation, and [48][19] for parameter estimation.

For one-way communication models, the main tool is a tensorization property related to the strong data processing constant (see (10) ahead), which was first used in [4] in the study of the error exponents in communication constrained hypothesis testing. The hypercontractivity method for single-shot bounds in one-way models was used in [30][27] for common randomness generation and [38] for testing.

In statistics, communication-constrained estimation has received considerable attention recently, starting from [52], which considered a model where distributed samples are compressed and sent to a central estimator. Further works on this communication model include settings of Gaussian location estimation [10][11], parametric estimation [22], nonparametric regression [53], Gaussian noise model [54], statistical inference [2], and nonparametric estimation [21](with a bug fixed in [7]) [1]. Related problems solved using similar techniques include differential privacy [16] and data summarization [37][46][45]. Communication-efficient construction of test statistics for distributed testing using the divide-and-conquer algorithm is studied in [8]. Generally speaking, these works on statistical minimax rates concern the so-called horizontal partitioning of data sets, where data sets share the same feature space but differ in samples [49][18]. In contrast, vertical distributed or federated learning, where data sets differ in features, has been used by corporations such as those in finance and medical care [49][18]. It is worth mentioning that such horizontal partitioning model was also introduced in Yao’s paper [50] in the context of function computation under the name “simultaneous message model”, where different parties send messages to a referee instead of to each other. The direct sum property (similar to the tensorization property of internal and external information) of the simultaneous message model was discussed in [13].

Organization of the paper. We review the background on nonparametric estimation, data processing constants and testing independence in Section II. The formulation of the two-party nonparametric estimation problem and the summary of main results are given in Section III. Section IV examines the problem of estimating a parameter in a pair of biased Bernoulli distributions, which will be used as a building block in our nonparametric estimation algorithm. Section V proves some bounds on information exchanges, which will be the key auxiliary results for the proof of upper bound for Bernoulli estimation in Section VI, and for nonparametric estimation in Section VII. Finally, lower bounds are proved in Section VIII in the one-way case and in Section IX in the interactive case.

II Preliminaries

II-A Notation

We use capital letters for probability measures and lower cases for the densities functions. We use the abbreviations $U_{i}^{j}:=(U_{i},\dots,U_{j})$ and $U^{j}:=U_{1}^{j}$ . We use boldface letters to denote vectors, for example ${\bf U}_{i}=(U_{i}(l))_{l=1}^{n}$ . Unless otherwise specified, the base of logarithms can be arbitrary but remain consistent throughout equations. The precise meaning of the Landau notations, such as $O(\cdot)$ , will be explained in each section or proofs of specific theorems. We use $\sum_{1\leq i\leq r}^{\rm odd}$ to denote summing over $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ . For the vector representation of a binary probability distribution, we use the convention that $P_{U}=[P_{U}(0),P_{U}(1)]$ . For the matrix representation of a joint distribution of a pair of binary random variables, we use the convention that $P_{XY}=\begin{bmatrix}P_{XY}(0,0)&P_{XY}(0,1)\\ P_{XY}(1,0)&P_{XY}(1,1)\end{bmatrix}$ . For $x\in[0,1]$ , we use the shorthand $\bar{x}:=1-x$ .

II-B Nonparametric Estimation

Let us recall the basics about the problem of estimating a smooth density; more details may be found in [44, 41]. Let $d\geq 1$ be an integer, and $s=(s_{1},\dots,s_{d})\in\{0,1,2,\dots\}^{d}$ be a multi-index. For $x=(x_{1},\dots,x_{d})\in\mathbb{R}^{d}$ , let $D^{s}$ denote the differential operator

\displaystyle D^{s}=\frac{\partial^{s_{1}+\dots+s_{d}}}{\partial x_{1}^{s_{1}}\cdots\partial x_{d}^{s_{d}}}.

(1)

Given $\beta\in(0,\infty)$ , let $\lfloor\beta\rfloor$ be the maximum integer strictly smaller than $\beta$ [44] (note the difference with the usual conventions). Given a function $f$ whose domain includes a set $\mathcal{A}\subseteq\mathbb{R}^{d}$ , define $\|f\|_{\mathcal{A},\beta}$ as the minimum $L\geq 0$ such that

\displaystyle|D^{s}f(x_{1})-D^{s}f(x_{2})|\leq L\|x_{1}-x_{2}\|_{2}^{\beta-\lfloor\beta\rfloor},\quad\forall x_{1},x_{2}\in\mathcal{A},

(2)

for all multi-indices $s$ such that $s_{1}+\dots+s_{d}=\lfloor\beta\rfloor$ . For example, $\beta=1$ define a Lipschitz function and an integer $\beta$ defines a function with bounded $\beta$ -th derivative.

Given $L>0$ , let $\mathcal{P}(\beta,L)$ be the class of probability density functions $p$ satisfying $\|p\|_{\mathbb{R}^{d},\beta}\leq L$ . Let $x_{0}\in\mathbb{R}^{d}$ be arbitrary. The following result on the minimax estimation error is well-known:

\displaystyle\inf_{T_{n}}\sup_{p\in\mathcal{P}(\beta,L)}\mathbb{E}[|T_{n}-p(x_{0})|^{2}]=\Theta(n^{-\frac{2\beta}{d+2\beta}})

(3)

where the infimum is over all estimators $T_{n}$ of $p(x_{0})$ , i.e., measurable maps from i.i.d. samples $X_{1},\dots,X_{n}\sim p$ to $\mathbb{R}$ . $\Theta(\cdot)$ in (3) may hide constants independent of $n$ .

We say $K\colon\mathbb{R}^{d}\to\mathbb{R}$ is a kernel of order $l$ ( $l\in\{1,2,\dots\}$ ) if $\int K=1$ and all up to the $l$ -th derivatives of the Fourier transform of $K$ vanish at $0$ [44, Definition 1.3]. Therefore the rectangular kernel, which is the indicator of a set, is order 1. A kernel estimator has the form

\displaystyle T_{n}=\frac{1}{nh^{d}}\sum_{l=1}^{n}K\left(\frac{X_{l}-x_{0}}{h}\right)

(4)

where $h\in(0,\infty)$ is called bandwidth. If $K$ is a kernel of order $l=\lfloor\beta\rfloor$ , then the kernel estimator (4) with appropriate $h$ achieves the bound in (3) [44, Chapter 1]. In particular, the rectangular kernel is minimax optimal for $\beta\in(0,2]$ .

If $K$ is compactly supported, then only local smoothness is needed, and density lower bound does not change the rate: we have

\displaystyle\inf_{T_{n}}\sup_{p\in\mathcal{P}_{\mathcal{S}}(\beta,L,A)}\mathbb{E}[|T_{n}-p(x_{0})|^{2}]=\Theta(n^{-\frac{2\beta}{d+2\beta}})

(5)

where $\mathcal{S}$ is any compact neighborhood of $x_{0}$ , $A\in[0,\frac{1}{\operatorname{vol}(\mathcal{S})})$ is arbitrary (with $\operatorname{vol}(\mathcal{S})$ denoting the volume of $\mathcal{S}$ ), and $\mathcal{P}_{\mathcal{S}}(\beta,L,A)$ denotes the non-empty set of probability density functions $p$ satisfying $\|p\|_{\mathcal{S},\beta}\leq L$ and $\inf_{x\in\mathcal{S}}p(x)\geq A$ .

II-C Strong and Symmetric Data Processing Constants

The strong data processing constant has proved useful in many distributed estimation problems [10, 4, 16, 52]. In particular, it is strongly connected to two-party hypothesis testing under the one-way protocol. In contrast, the symmetric data processing constant [32] can be viewed as a natural extension to interactive protocols. This section recalls their definitions and auxiliary results, which will mainly be used in the proofs of lower bounds; however, the intuitions are useful for the upper bounds as well.

Given two probability measures $P$ , $Q$ on the same measurable space, define the KL divergence

\displaystyle D(P\|Q):=\int\log\left(\frac{dP}{dQ}\right)dP.

(6)

Define the $\chi^{2}$ -divergence

\displaystyle D_{\chi^{2}}(P\|Q):=\int\left(\frac{dP}{dQ}-1\right)^{2}dQ.

(7)

Let $X,Y$ be two random variables with joint distribution $P_{XY}$ . Define the mutual information

\displaystyle I(X;Y):=D(P_{XY}\|P_{X}\times P_{Y}).

(8)

Definition 1.

Let $P_{XY}$ be an arbitrary distribution on $\mathcal{X}\times\mathcal{Y}$ . Define the strong data processing constant

\displaystyle s^{*}(X;Y):=\sup_{P_{U|X}}\frac{I(U;Y)}{I(U;X)}

(9)

where $P_{U|X}$ is a conditional distribution (with $\mathcal{U}$ being an arbitrary set), and the mutual informations are computed under the joint distribution $P_{U|X}P_{XY}$ .

Clearly, the value of $s^{*}(X;Y)$ does not depend on the choice of the base of logarithm. A basic yet useful property of the strong data processing constant is tensorization: if ${\bf(X,Y)}\sim P_{XY}^{\otimes n}$ then

\displaystyle s^{*}({\bf X;Y})=s^{*}(X;Y).

(10)

Now if $({\bf X;Y})$ are the samples observed by Alice and Bob, $\Pi_{1}$ denotes the message sent to Bob, then $I(\Pi_{1};{\bf X})\leq k$ implies that

\displaystyle D(P_{\Pi_{1}{\bf Y}}\|P_{\Pi_{1}}P_{\bf Y})\leq s^{*}(X;Y)k.

(11)

The left side is the KL divergence between the distribution under the hypothesis that $(X,Y)$ follows some joint distribution, and the distribution under the hypothesis that $X$ and $Y$ are independent. Thus the error probabilities in testing against independence with one-way protocols can be lower bounded. This simple argument dates back at least to [4, 3].

A similar argument can be extended to testing independence under interactive protocols [48]. The fundamental fact enabling such extensions is the tensorization of certain information-theoretic quantities, which appeared in various contexts [25, 9, 32].

Definition 2.

Let $(X,Y)\sim P_{XY}$ . For given $r<\infty$ , define $s_{r}^{*}(X;Y)$ as the supremum of $R/S$ such that there exists random variables $U_{1},\dots,U_{r}$ satisfying

	$\displaystyle R\leq\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X\|U^{i-1});$		(12)
	$\displaystyle S\geq\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};X\|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};Y\|U^{i-1}),$		(13)

and

	$\displaystyle U_{i}-(X,U^{i-1})-Y,\quad i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$		(14)
	$\displaystyle U_{i}-(Y,U^{i-1})-X,\quad i\in\{1,\dots,r\}\cap 2\mathbb{Z}$		(15)

are Markov chains. We call $s_{\infty}^{*}(X;Y)$ the symmetric data processing constant.

Let us remark that using the Markov chains we have the right side of (12)

	$\displaystyle\quad\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X\|U^{i-1})$
	$\displaystyle=I(X;Y)-I(X;Y\|U^{r})$		(16)
	$\displaystyle=I(U^{r};XY)-[I(U^{r};X\|Y)+I(U^{r};Y\|X)]$		(17)

whereas the right side of (13)

\displaystyle\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};X|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};Y|U^{i-1})=I(U^{r};XY).

(18)

In the computer science literature [9], $I(U^{r};XY)$ is called the external information whereas $I(U^{r};X|Y)+I(U^{r};Y|X)$ the internal information.

The symmetric strong data processing constant is symmetric in the sense that $s_{\infty}^{*}(X;Y)=s_{\infty}^{*}(Y;X)$ , since $r=\infty$ in the definition. On the other hand, $s_{1}^{*}(X;Y)$ coincides with the strong data processing constant which is generally not symmetric. Furthermore, a tensorization property holds for the internal and external information: denote by $\mathcal{R}(X;Y)$ the set of all $(R,S)$ satisfying (12) and (13) for some $U_{1},\dots,U_{r}$ . Let ${\bf(X,Y)}\sim P_{XY}^{\otimes n}$ . Then

\displaystyle\mathcal{R}({\bf X;Y})=n\mathcal{R}(X;Y).

(19)

In particular, taking the slope of the boundary at the original yields

\displaystyle s_{\infty}^{*}({\bf X;Y})=s_{\infty}^{*}(X;Y).

(20)

A useful and general upper bound on $s_{\infty}^{*}$ in terms of SVD was provided in [32, Theorem 4], which implies that $s_{\infty}^{*}=s_{1}^{*}$ when $X$ and $Y$ are unbiased Bernoulli. However, that bound is not tight enough for the nonparametric estimation problem we consider, and in fact we adopt a new approach in Section IX for the biased Bernoulli distribution. Let us remark that $s_{\infty}^{*}=s_{1}^{*}$ holds also for Gaussian $(X,Y)$ , which follows by combining the result on unbiased Bernoulli distribution and a central limit theorem argument [32] (see also [19]). Moreover, it was conjectured in [32] that the set of possible $(R,S)$ satisfying (12)-(13) does not depend on $r$ when $X$ and $Y$ are unbiased Bernoulli.

II-D Testing Against Independence

Consider the following setting: $P_{XY}$ is an arbitrary distribution on $\mathcal{X}\times\mathcal{Y}$ ; $P_{\bf XY}:=P_{XY}^{\otimes n}$ ; $\Pi=(\Pi_{0},\dots,\Pi_{r})$ is a sequence of random variables, with $P_{\Pi|{\bf XY}}$ being given and satisfying $P_{\Pi_{0}|{\bf XY}}=P_{\Pi_{0}}$ , $P_{\Pi_{i}|{\bf XY}\Pi_{0}^{i-1}}=P_{\Pi_{i}|{\bf X}\Pi_{0}^{i-1}}$ for $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ and $P_{\Pi_{i}|{\bf XY}\Pi_{0}^{i-1}}=P_{\Pi_{i}|{\bf Y}\Pi_{0}^{i-1}}$ for $i\in\{1,\dots,r\}\cap 2\mathbb{Z}$ ; $\bar{P}_{\bf XY}=P_{\bf X}P_{\bf Y}$ is the distribution under the hypothesis of independence, and $\bar{P}_{\Pi{\bf XY}}:=P_{\Pi|{\bf XY}}\bar{P}_{\bf XY}$ . The following result is known in [48, 20, 19]:

Lemma 1.

$D(P_{{\bf Y}\Pi}\|\bar{P}_{{\bf Y}\Pi})\leq I({\bf X;Y})-I({\bf X;Y}|\Pi)$ .

Now by Definition 2, we immediately have

	$\displaystyle s_{r}^{*}(X;Y)$	$\displaystyle\geq\frac{I({\bf X;Y})-I({\bf X;Y}\|\Pi)}{I({\bf XY};\Pi)}$		(21)
		$\displaystyle\geq\frac{D(P_{{\bf Y}\Pi}\\|\bar{P}_{{\bf Y}\Pi})}{H(\Pi)}$		(22)

which generalizes (11). Therefore, $s_{r}^{*}(X;Y)$ can be used to bound $D(P_{{\bf Y}\Pi}\|\bar{P}_{{\bf Y}\Pi})$ , and in turn, the error probability in indepedence testing.

III Problem Setup and Main Results

We consider estimating the density function at a given point, where the density is assumed to be Hölder continuous in a neighborhood of that point. It is clear that there is no loss of generality assuming such neighborhood to be the unit cube, and that the given point is its center. More precisely, the class of densities under consideration is defined as follows:

Definition 3.

Given $d\in\{1,2,\dots\}$ , $L>0$ , $A\in[0,1)$ , and $\beta>0$ , let $\mathcal{H}(\beta,L,A)$ be the set of all probability density $p_{XY}$ on $\mathcal{X}\times\mathcal{Y}$ (where $\mathcal{X}=\mathcal{Y}=\mathbb{R}^{d}$ ) satisfying

\displaystyle p_{X}(x),p_{Y}(y)\geq A,\quad\forall x,y\in[0,1]^{d},

(23)

and

\displaystyle\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L.

(24)

Definition 4.

We say $\mathcal{C}$ is a prefix code [14] if it is a subset of the set of all finite non-empty binary sequences satisfying the property that for any distinct $s_{1},s_{2}\in\mathcal{C}$ , $s_{1}$ cannot be a prefix of $s_{2}$ .

The problem is to estimate the density at a given point of an unknown distribution from $\mathcal{H}(\beta,L,A)$ . More precisely,

•

$P_{XY}$ is a fixed but unknown distribution whose corresponding density $p_{XY}$ belongs to $\mathcal{H}(\beta,L,A)$ for some $\beta\in(0,\infty)$ , $L\in(0,\infty)$ , and $A\in[0,1)$ .
•

Infinite sequence of pairs $(X(1),Y(1))$ , $(X(2),Y(2))$ ,…are i.i.d. according to $P_{XY}$ . Alice (Terminal 1) observes ${\bf X}=(X(l))_{l=1}^{\infty}$ and Bob (Terminal 2) observes ${\bf Y}=(Y(l))_{l=1}^{\infty}$ .
•

Unlimited common randomness $\Pi_{0}$ is observed by both Alice and Bob. That is, an infinite random bit string independent of $\bf(X,Y)$ shared by Alice and Bob.
•

For $i=1,\dots,r$ ( $r$ is an integer), if $i$ is odd, then Alice sends to Bob a message $\Pi_{i}$ , which is an element in a prefix code, where $\Pi_{i}$ is computed using the common randomness $\Pi_{0}$ , the previous transcripts $\Pi^{i-1}=(\Pi_{1},\dots,\Pi_{i-1})$ , and ${\bf X}$ ; if $i$ is even, then Bob sends to Alice a message $\Pi_{i}$ computed using $\Pi_{0}$ , $\Pi^{i-1}$ , and ${\bf Y}$ .
•

Bob computes an estimate $\hat{p}$ of the true density $p_{XY}(x_{0},y_{0})$ , where $x_{0}=y_{0}$ is the center of $[0,1]^{d}$ .

One-way NP Estimation Problem. Suppose that $r=1$ . Under the constraint on the expected length of the transcript (i.e. length of the bit string)

\displaystyle\mathbb{E}[|\Pi^{r}|]\leq k,

(25)

where $k>0$ is a real number, what is the minimax risk

\displaystyle R(k):=\min_{\hat{p},\bf\Pi}\max_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]?

(26)

Interactive NP Estimation Problem. Under the same constraint on the expected length of the transcript, but without any constraint on the number of rounds $r$ , what is the minimax risk?

Remark 1.

The prefix condition ensures that Bob knows that the current round has terminated after finishing reading each $\Pi_{i}$ . Alternatively, the problem can be formulated by stating that $\Pi_{i}$ is a random variable in an arbitrary alphabet, and replacing (25) by the entropy constraint $H(\Pi^{r})\leq k$ . Furthermore, one may use the information leakage constraint $I({\bf X,Y};\Pi^{r})\leq k$ instead. From our proofs it is clear that the minimax rates will not change under these alternative formulations.

Remark 2.

There would be no essential difference if the problem were formulated with $|\Pi|\leq k$ almost surely and $|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}\leq R(k)$ with probability (say) at least $1/2$ . Indeed, for the upper bound direction, those conditions are satisfied with a truncation argument, once we have an algorithm satisfying $\mathbb{E}[|\Pi|]\leq k/4$ and $\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq R(k)/4$ , by Markov’s inequality and the union bound, therefore results only differ with a constant factor. For the lower bound, the proof can be extended to the high probability version, since we used the Le Cam style argument [51].

Remark 3.

The common randomness assumption is common in the communication complexity literature, and, in some sense, is equivalent to private randomness [34]. In our upper bound proof, the common randomness is the randomness in the codebooks. Random codebooks give rise to convenient properties, such as the fact that the expectation of the distribution of the matched codewords equals exactly the product of idealized single-letter distributions (82). It is likely, however, that some approximate versions of these proofs steps, and ultimately the same asymptotic risk, should hold for some carefully designed deterministic codebooks.

Theorem 1.

In one-way NP estimation, for any $\beta\in(0,\infty)$ , $L\in(0,\infty)$ , and $A\in[0,1)$ ,

\displaystyle R(k)=\Theta(\left(\frac{k}{\log k}\right)^{-\frac{2\beta}{d+2\beta}})

(27)

where $\Theta(\cdot)$ hides multiplicative factors depending on $L$ , $\beta$ , and $A$ .

The proof of the upper bound is in Section VII-B. Recall that nonparametric density estimation using a rectangular kernel is equivalent to counting the frequency of samples in a neighborhood of a given diameter, the bandwidth, which we denote as $\Delta$ . A naive protocol is for Alice to send the indices of samples in $x_{0}+[-\Delta,\Delta]^{d}$ . Locating each sample in that neighborhood requires on average $\Theta(\log\frac{1}{\Delta})=\Theta(\log k)$ bits. Thus $\Theta(k/\log k)$ samples in that neighborhood can be located. It turns out that the naive protocol is asymptotically optimal.

The proof of the lower bound (Section VIII) follows by a reduction to testing independence for biased Bernoulli distributions, via a simulation argument. Although some arguments are similar to [19], the present problem concerns biased Bernoulli distributions instead. The (KL) strong data processing constant turns out to be drastically different from the $\chi^{2}$ data processing constant, as opposed to the cases of many familiar distributions such as the unbiased Bernoulli or the Gaussian distributions.

As alluded, our main result is that the risk can be strictly improved when interactions are allowed:

Theorem 2.

In interactive NP estimation, for any $\beta\in(0,\infty)$ , $L\in(0,\infty)$ , and $A\in[0,1)$ , we have

\displaystyle R(k)=\Theta\left(k^{-\frac{2\beta}{d+2\beta}}\right)

(28)

where $\Theta(\cdot)$ hides multiplicative factors depending on $L$ , $\beta$ and $A$ .

To achieve the scaling in (28), $r$ can grow as slowly as the super-logarithm (i.e., inverse tetration) of $k$ ; for the precise relation between $r$ and $k$ , see Section V-C.

The proof of the upper bound of Theorem 2 is given in Section VII-C, which is based on a novel multi-round estimation scheme for biased Bernoulli distributions formulated and analyzed in Sections IV,V,VI. Roughly speaking, the intuition is to “locate” the samples within neighborhoods of $(x_{0},y_{0})$ by successive refinements, which is more communication-efficient than revealing the location at once.

The lower bound of Theorem 2 is proved in Section IX. The main technical hurdle is to develop new and tighter bounds on the symmetric data processing constant in [32] for the biased binary cases.

IV Estimation of Biased Bernoulli Distributions

In this section, we shall describe an algorithm for estimating the joint distribution of a pair of biased Bernoulli random variables. The biased Bernoulli estimation problem can be viewed as a natural generalization of the Gap hamming problem [24][12] to the sparse setting, and is the key component in both the upper and lower bound analysis for the nonparametric estimation problem. Indeed, we shall explain in Section VII that our nonparametric estimator is based on a linear combination of rectangle kernel estimators, which estimate the probability that $X$ and $Y$ fall into neighborhoods of $x_{0}$ and $y_{0}$ . Indicators that samples are within such neighborhoods are Bernoulli variables, so that the biased Bernoulli estimator can be used. For the lower bound, we shall explain in Section VIII that the nonparametric estimation problem can be reduced to the biased Bernoulli estimation problem via a simulation argument.

For notational simplicity, we shall use $X,Y$ for the Bernoulli variables in this section as well as Sections V-VI, although we should keep in mind that these are not the continuous variables in the original nonparametric estimation problem.

Bernoulli Estimation Problem:

•

Fixed real numbers $m_{1},m_{2}\in(10,\infty)$ , and an unknown $\delta\in[-1,\min\{m_{1},m_{2}\}-1]$ .

•

${\bf(X,Y)}=(X(l),Y(l))_{i=1}^{\infty}$ i.i.d. according to the distribution

	$\displaystyle P^{(\delta)}_{XY}:=$
	$\displaystyle\left(\begin{array}[]{cc}\frac{1}{m_{1}m_{2}}(1+\delta)&\frac{1}{m_{1}}(1-\frac{1}{m_{2}})-\frac{\delta}{m_{1}m_{2}}\\ \frac{1}{m_{2}}(1-\frac{1}{m_{1}})-\frac{\delta}{m_{1}m_{2}}&(1-\frac{1}{m_{1}})(1-\frac{1}{m_{2}})+\frac{\delta}{m_{1}m_{2}}\end{array}\right)$		(31)

where we recall our convention that the upper left entry of the matrix denotes the probability that $X=Y=0$ . Alice observes $(X(l))_{l=1}^{\infty}$ and Bob observes $(Y(l))_{l=1}^{\infty}$ .

•

Unlimited common randomness $\Pi_{0}$ .

Goal: Alice and Bob exchange messages in no more than $r$ rounds in order to estimate $\delta$ .

Our algorithm is described as follows:

Input. $m_{1},m_{2}\in(10,\infty)$ ; positive integer $n$ and $r$ ; a sequence of real numbers $\alpha_{1},\dots,\alpha_{r}\in(1,\infty)$ satisfying

	$\displaystyle\prod_{1\leq i\leq r}^{\rm odd}\alpha_{i}\leq\frac{m_{1}}{10};$		(32)
	$\displaystyle\prod_{1\leq i\leq r}^{\rm even}\alpha_{i}\leq\frac{m_{2}}{10}.$		(33)

The $\alpha_{1},\dots,\alpha_{r}$ can be viewed as parameters of the algorithm, and controls how much information is revealed about the locations of “common zeros” of ${\bf X,Y}$ in each round of communication. For example, setting $\alpha_{1}=\frac{m_{1}}{10}$ and all other $\alpha_{i}=1$ yields a one-way communication protocol, whereas setting all $\alpha_{i}>1$ yields a “successive refinement” algorithm which may incur smaller communication budget yet convey the same amount of information.

Before describing the algorithm, let us define a conditional distribution $P_{U^{r}|XY}$ by recursion, which will be used later in generation random codebooks.

Definition 5.

For each $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ , define

$\displaystyle P_{U_{i}\|X=0,U^{i-1}={\bf 0}}$	$\displaystyle=[1,0];$	(34)
$\displaystyle P_{U_{i}\|X=1,U^{i-1}={\bf 0}}$	$\displaystyle=[\alpha_{i}^{-1},1-\alpha_{i}^{-1}];$	(35)
$\displaystyle P_{U_{i}\|X=0,U^{i-1}\neq{\bf 0}}$	$\displaystyle=P_{U_{i}\|X=1,U^{i-1}\neq{\bf 0}}=[0,1].$	(36)

Then set $P_{U_{i}|XYU^{i-1}}=P_{U_{i}|XU^{i-1}}$ . For $i=1,\dots,r$ even, we use similar definitions, but with the roles of $X$ and $Y$ switched. This specifies $P_{U_{i}|XYU^{i-1}}$ , $i=1,\dots,r$ .

Note that by Definition 5, $U_{i}=1$ implies $U_{i+1}=1$ for each $i=1,\dots,r-1$ . In words, for $i$ odd, $U_{i}$ marks all $X=0$ as $0$ , and marks $X=1$ as either $0$ or $1$ ; whenever $U_{i}=1$ is marked, then $X$ is definitely 1, and will be forgotten in all subsequent rounds. Now set

\displaystyle P_{XYU^{r}}^{(\delta)}:=P_{U^{r}|XY}P^{(\delta)}_{XY}.

(37)

where $P_{U^{r}|XY}$ is induced by $(P_{U_{i}|XYU^{i-1}})_{i=1}^{r}$ in Definition 5.

Initialization. By applying a common function to the common randomness, Alice and Bob can produce a shared infinite array $(V_{i,j}(l))$ , where $i\in\{1,\dots,r\}$ , $j\in\{1,2,\dots\}$ , $l\in\{1,2,\dots,n\}$ , such that the entries in the array are independent random variables, with $V_{i,j}(l)\sim{\rm Bern}(1-\alpha_{i}^{-1})$ . Also set

\displaystyle U_{0}(l)=0,\quad\forall l=1,\dots,n.

(38)

Iterations. Consider any $i=1,\dots,r$ , where $i$ is odd. We want to generate ${\bf U}_{i}$ by selecting a codeword so that $({\bf X,Y},{\bf U}^{i})$ follows the distribution of $(P_{XYU^{i}}^{(\delta)})^{\otimes n}$ , where ${\bf U}^{i-1}$ is defined in previous rounds. Define

$\displaystyle\mathcal{A}_{0}$	$\displaystyle:=\{l\leq n\colon X(l)=0,U_{i-1}(l)=0\};$	(39)
$\displaystyle\mathcal{A}_{1}$	$\displaystyle:=\{l\leq n\colon X(l)=1,U_{i-1}(l)=0\};$	(40)
$\displaystyle\mathcal{A}$	$\displaystyle:=\{l\leq n\colon U_{i-1}(l)=0\}.$	(41)

Note that Alice knows both $\mathcal{A}_{0}$ and $\mathcal{A}_{1}$ , while Bob knows $\mathcal{A}$ , since it will be seen from the recursion that Alice and Bob both know ${\bf U}_{1},\dots,{\bf U}_{i-1}$ at the beginning of the $i$ -th round. Alice chooses $\hat{j}_{i}$ as the minimum nonnegative integer $j$ such that

\displaystyle V_{i,j}(l)=0,\quad\forall l\in\mathcal{A}_{0}.

(42)

Alice encodes $\hat{j}_{i}$ using a prefix code, e.g. Elias gamma code [17], and sends it to Bob. Then both Alice and Bob compute ${\bf U}_{i}=(U_{i}(l))_{l=1}^{n}\in\{0,1\}^{n}$ by

	$\displaystyle U_{i}(l)$	$\displaystyle:=V_{(i,\hat{j}_{i})}(l),\quad\forall l\in\mathcal{A};$		(43)
	$\displaystyle U_{i}(l)$	$\displaystyle:=1,\quad\forall l\in\{1,\dots,n\}\setminus\mathcal{A}.$		(44)

The operations in the $i$ -th round for even $i$ is similar, with the roles of Alice and Bob reversed. We will see later that the notation ${\bf U}_{i}$ is consistent in the sense of (53).

Estimator. Recall that in classical parametric statistics, one can evaluate the score function at the sample, compute its expectation and variation, and construct an estimator achieving the Cramer-Rao bound asymptotically. Now for $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ , define the score function

	$\displaystyle\Gamma_{i}(u^{i},y)$	$\displaystyle:=\frac{\partial}{\partial\delta}\left.\ln P^{(\delta)}_{U_{i}\|YU^{i-1}}(u_{i}\|y,u^{i-1})\right\|_{\delta=0}$		(45)
		$\displaystyle=\left\{\begin{array}[]{ccc}\frac{\partial}{\partial\delta}\left.\ln P^{(\delta)}_{U_{i}\|YU^{i-1}}(u_{i}\|{y,\bf 0})\right\|_{\delta=0}&\textrm{if }u^{i-1}={\bf 0}\\ 0&\textrm{otherwise}\end{array}\right.$		(48)

where $P^{(\delta)}_{U_{i}|YU^{i-1}}$ is induced by $P_{XYU^{r}}^{(\delta)}$ . For $i\in\{1,\dots,r\}\cup 2\mathbb{Z}$ , define $\Gamma_{i}(u^{i},x)$ similarly with the roles of $X$ and $Y$ reversed. Alice and Bob can each compute

\displaystyle\Gamma^{\rm A}:=\sum_{1\leq i\leq r}^{\rm even}\sum_{l=1}^{n}\Gamma_{i}(U^{i}(l),X(l))

(49)

and

\displaystyle\Gamma^{\rm B}:=\sum_{1\leq i\leq r}^{\rm odd}\sum_{l=1}^{n}\Gamma_{i}(U^{i}(l),Y(l))

(50)

respectively. Finally, Alice’s and Bob’s estimators are given by

	$\displaystyle\hat{\delta}^{\rm A}$	$\displaystyle:=\Gamma^{\rm A}\cdot\left(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm A}]\right)^{-1};$		(51)
	$\displaystyle\hat{\delta}^{\rm B}$	$\displaystyle:=\Gamma^{\rm B}\cdot\left(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]\right)^{-1},$		(52)

where $\mathbb{E}^{(\delta)}$ refers to expectation when the true parameter is $\delta$ , and $\partial_{\delta}$ denotes the derivative in $\delta$ . We will show that these estimators are well-defined: $\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm A}]$ and $\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]$ are independent of $\delta$ (Lemma 3), and can be computed by Alice and Bob without knowing $\delta$ .

Lemma 2.

For each $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ , conditioned on ${\bf X,Y},{\bf U}_{1},\dots,{\bf U}_{i-1}$ , we have

\displaystyle{\bf U}_{i}\sim P^{\otimes n}_{U_{i}|XU^{i-1}}(\cdot|{\bf X},{\bf U}_{1},\dots,{\bf U}_{i-1}),

(53)

where $P_{U_{i}|XU^{i-1}}$ is as defined in (34)-(36). A similar relation holds for even $i$ .

Proof.

Immediate from (43)-(44). ∎

Lemma 3.

$\mathbb{E}^{(\delta)}[\Gamma^{\rm A}]$ and $\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]$ are linear in $\delta$ .

Proof.

By (53),

	$\displaystyle\quad\mathbb{E}^{(\delta)}[\Gamma_{i}(U^{i}(l)),Y(l)]$		(54)
	$\displaystyle=\sum_{u^{r},x,y}\Gamma_{i}(u^{i},y)P^{(\delta)}_{XY}(x,y)P_{U^{r}\|XY}(u^{r}\|x,y)$		(55)

for each $i$ odd and $l\in\{1,\dots,n\}$ , and similar expressions hold for $i$ even. The claims then follow. ∎

Theorem 3.

$\hat{\delta}^{\rm A}$ and $\hat{\delta}^{\rm B}$ are unbiased estimators.

Proof.

For $i$ odd, by (45) we have $\sum_{u_{i}}\Gamma_{i}(u^{i},y)P^{(0)}_{U_{i}|YU^{i-1}}(u_{i}|y,u^{i-1})=0$ for any $(y,u^{i-1})$ . Then $\mathbb{E}^{(0)}[\Gamma_{i}(U^{i}(l)),Y(l)]=0$ follows from (55). It follows that $\mathbb{E}^{(0)}[\hat{\delta}^{\rm A}]=\mathbb{E}^{(0)}[\hat{\delta}^{\rm B}]=0$ , and unbiasedness is implied by Lemma 3. ∎

V Bounds on Information Exchanges

In this section we prove key auxiliary results that will be used in the upper bounds.

V-A General $(\alpha_{i})$

The following Theorem is crucial for the achievability part of the analysis of the Bernoulli estimation problem described in Section IV (and hence for the nonparametric estimation problem). Specifically, (56)-(57) bounds the communication from Alice to Bob and in reverse, and (58)-(59) bounds the information exchanged which, in turn, will bound the estimation risk via Fano’s inequality.

Theorem 4.

Consider any $m_{1},m_{2}>10$ , $\alpha_{1},\dots,\alpha_{r}\in(1,\infty)$ satisfying (32)-(33), and $P_{U^{r}XY}^{(\delta)}$ as in (37). We have

	$\displaystyle\sum_{1\leq i\leq r}^{\rm odd}P^{(0)}_{XU^{i-1}}(0,{\bf 0})\log\alpha_{i}$	$\displaystyle\leq\frac{1.1}{m_{1}}\sum_{1\leq i\leq r}^{\rm odd}\log\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1};$		(56)
	$\displaystyle\sum_{1\leq i\leq r}^{\rm even}P^{(0)}_{YU^{i-1}}(0,{\bf 0})\log\alpha_{i}$	$\displaystyle\leq\frac{1.1}{m_{2}}\sum_{1\leq i\leq r}^{\rm even}\log\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1},$		(57)

and assuming the natural base of logarithms,

	$\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})$	$\displaystyle\geq\frac{1}{5m_{1}^{2}m_{2}}\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j};$		(58)
	$\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X\|U^{i-1})$	$\displaystyle\geq\frac{1}{5m_{1}m_{2}^{2}}\prod_{1\leq j\leq r}^{\rm even}\alpha_{j}.$		(59)

The proof can be found in Appendix A.

Remark 4.

Since

	$\displaystyle\quad P^{(0)}_{XU^{i-1}}(0,{\bf 0})\log\alpha_{i}$
	$\displaystyle=P^{(0)}_{XU^{i-1}}(0,{\bf 0})D(P_{U_{i}\|X=0,U^{i-1}={\bf 0}}\\|P_{U_{i}\|X=1,U^{i-1}={\bf 0}})$		(60)
	$\displaystyle\geq P^{(0)}_{U^{i-1}}({\bf 0})\inf_{Q}\left[P^{(0)}_{X\|U^{i-1}}(0\|{\bf 0})D(P_{U_{i}\|X=0,U^{i-1}={\bf 0}}\\|Q)\right.$
	$\displaystyle\quad+\left.P^{(0)}_{X\|U^{i-1}}(1\|{\bf 0})D(P_{U_{i}\|X=1,U^{i-1}={\bf 0}}\\|Q)\right]$		(61)
	$\displaystyle=P^{(0)}_{U^{i-1}}({\bf 0})I(U_{i};X\|U^{i-1}={\bf 0})$		(62)
	$\displaystyle=I(U_{i};X\|U^{i-1}),$		(63)

Theorem 4 also implies the following bound on the external information (see (18)):

	$\displaystyle I(U^{r};XY)$	$\displaystyle\leq\frac{1.1}{m_{1}}\sum_{1\leq i\leq r}^{\rm odd}\log\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}$
		$\displaystyle\quad+\frac{1.1}{m_{2}}\sum_{1\leq i\leq r}^{\rm even}\log\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}.$		(64)

Remark 5.

Let us provide some intuition why interaction helps, assuming the case of $m_{1}=m_{2}=m$ for simplicity. From the proof of Theorem 4, it can be seen that up to a constant factor, $s_{\infty}^{*}(X;Y)$ is at least $\frac{\delta^{2}}{m^{3}}\int_{0}^{\ln\frac{m}{100}}e^{t}{\rm d}t\left(\frac{1}{m}\int_{0}^{\ln\frac{m}{100}}e^{-t}{\rm d}t\right)^{-1}\sim\frac{\delta^{2}}{m}$ (which is in fact sharp as will be seen from the upper bound on $s_{\infty}^{*}(X;Y)$ in Theorem 6). Moreover, lower bounds on $s_{r}^{*}(X;Y)$ can be computed by replacing the integrals with discrete sums with $r$ terms:

\displaystyle\frac{\delta^{2}}{m^{3}}\sum_{i=1}^{\lceil r/2\rceil}(e^{t_{i}}-e^{t_{i-1}})\left(\frac{1}{m}\sum_{i=1}^{\lceil r/2\rceil}e^{-t_{i-1}}(t_{i}-t_{i-1})\right)^{-1}

(65)

where $1=t_{0}<t_{1}<\dots<t_{\lceil r/2\rceil}=\ln\frac{m}{100}$ . In particular, when $r=1$ , we recover $s_{1}^{*}(X;Y)\sim\frac{\delta^{2}}{m\ln m}$ , whereas choosing $t_{i}-t_{i-1}=1$ , $i=1,\dots,\lceil r/2\rceil$ shows that $r\sim\ln m$ achieves $s_{r}^{*}(X;Y)\sim\frac{\delta^{2}}{m}$ . Even better, later we will take $t_{k}$ as the $k$ -th iterated power of $2$ , and then $r$ will be the super logarithm of $m$ .

Recall that $(\alpha_{i})$ control the amount of information revealed in each round and serve as hyperparameters of the algorithm to be tuned. Next we shall explain how to select the value of $(\alpha_{i})$ so that the optimal performance is achieved in the one-way and interactive settings.

V-B $r=1$ Case

Specializing Theorem 4 we obtain:

Corollary 4.

For any $m_{1},m_{2}>10$ , with $r=1$ and $\alpha_{1}=\frac{m_{1}}{10}$ we have

	$\displaystyle P^{(0)}_{X}(0)\log\alpha_{1}$	$\displaystyle\leq\frac{1.1}{m_{1}}\log\frac{m_{1}}{10};$		(66)
	$\displaystyle\lim_{\delta\to 0}\delta^{-2}I(U_{1};Y)$	$\displaystyle\geq\frac{1}{50m_{1}m_{2}}.$		(67)

V-C $r=\infty$ Case

Denote by ${}^{n}2$ the $n$ -th tetration of 2, which is defined recursively by ${}^{0}2=1$ and

\displaystyle{}^{n}2:=2^{\left({}^{(n-1)}2\right)},\quad\forall n\geq 1.

(68)

Let $m:=\min\{m_{1},m_{2}\}$ , and let $r_{0}$ be the minmum integer such that

\displaystyle\exp_{e}({}^{r_{0}}2-1)\geq\frac{m}{10}.

(69)

For $m>10$ we have $r_{0}\geq 1$ . Then we set

$\displaystyle r$	$\displaystyle:=2r_{0};$	(70)
$\displaystyle\alpha_{2k-1}$	$\displaystyle:=\alpha_{2k}:=\exp_{e}({}^{k}2-{}^{(k-1)}2),\quad\forall k\in\{1,\dots,r_{0}-1\};$	(71)
$\displaystyle\alpha_{2r_{0}-1}$	$\displaystyle:=\alpha_{2r_{0}}=\frac{m}{10}\exp_{e}(1-{}^{(r_{0}-1)}2),$	(72)

which fulfills $\alpha_{i}>1$ . We see that

	$\displaystyle\quad\sum_{1\leq i\leq r}^{\rm odd}\ln\alpha_{i}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}$
	$\displaystyle\leq\sum_{k=1}^{r_{0}}\left({}^{k}2-{}^{(k-1)}2\right)\exp_{e}\left(1-{}^{(k-1)}2\right)$		(73)
	$\displaystyle\leq e\sum_{k=1}^{\infty}{}^{k}2\exp_{e}\left(-{}^{(k-1)}2\right)$		(74)
	$\displaystyle=e\sum_{k=1}^{\infty}\exp_{e}\left(-(1-\log 2)\cdot{}^{(k-1)}2\right)$		(75)
	$\displaystyle<5.$		(76)

The first inequality above follows by $\alpha_{r-1}=\frac{m}{10}\exp_{e}(1-{}^{(r_{0}-1)}2)\leq\exp_{e}({}^{r_{0}}2-{}^{(r_{0}-1)}2)$ . Similarly we also have $\sum_{1\leq i\leq r}^{\rm even}\ln\alpha_{i}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}<5$ . Moreover,

\displaystyle\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}=\prod_{1\leq j\leq r}^{\rm even}\alpha_{j}=\frac{m}{10}.

(77)

Summarizing, we have

Corollary 5.

Consider $m_{1},m_{2}>10$ , $m:=\min\{m_{1},m_{2}\}$ , and $(\alpha_{i})$ defined in (70)-(72). We have

$\displaystyle\sum_{1\leq i\leq r}^{\rm odd}P^{(0)}_{XU^{i-1}}(0,{\bf 0})\ln\alpha_{i}$	$\displaystyle\leq\frac{6}{m_{1}};$	(78)
$\displaystyle\sum_{1\leq i\leq r}^{\rm even}P^{(0)}_{YU^{i-1}}(0,{\bf 0})\ln\alpha_{i}$	$\displaystyle\leq\frac{6}{m_{2}};$	(79)
$\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})$	$\displaystyle\geq\frac{m}{50m_{1}^{2}m_{2}};$	(80)
$\displaystyle\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X\|U^{i-1})$	$\displaystyle\geq\frac{m}{50m_{1}m_{2}^{2}}.$	(81)

where $r=2r_{0}$ and $r_{0}$ is defined in (69).

Let us remark that the sequence $(\alpha_{i})$ we used in (70)-(72) is essentially optimal: Let $\beta_{k}:=\prod_{2\leq j\leq 2k}^{\rm even}\alpha_{j}^{-1}$ . In order for (73) to converge, we need $\sum_{k}\ln(\frac{\beta_{k}}{\beta_{k-1}})\beta_{k-1}^{-1}$ to be convergent. Therefore $\beta_{k}$ cannot grow faster than $\beta_{k}=\exp(\beta_{k-1})$ which is tetration. However this only amounts to a lower bound on $r$ for a particular design of $P_{U^{r}|XY}$ in Definition 5. Since tetration grows super fast, from a practical viewpoint $r$ is essentially a constant. Nevertheless, it remains an interesting theoretical question whether $r$ needs to diverge:

Conjecture 1.

If there is an algorithm (indexed by $k$ ) achieving the optimal risk (28) for nonparametric estimation, then necessarily $r\to\infty$ as $k\to\infty$ .

VI Achievability Bounds for Bernoulli Estimation

In this section we analyze the performance of the Bernoulli distribution estimation algorithm described in Section IV.

VI-A Communication Complexity

Consider any $i\in\{1,\dots,r\}$ . Denoting by $\widehat{P}_{{\bf XY}{\bf U}^{i}}$ the empirical distribution of $(X(l),Y(l),U_{1}(l),\dots,U_{i}(l))_{l=1}^{n}$ , we have from (53) that

\displaystyle\mathbb{E}^{(\delta)}[\widehat{P}_{{\bf XY}{\bf U}^{i}}|{\bf X,Y,}{\bf U}^{i-1}]=\widehat{P}_{{\bf XY}{\bf U}^{i-1}}P_{U_{i}|XU^{i-1}}.

(82)

In particular,

\displaystyle\mathbb{E}^{(\delta)}[\widehat{P}_{{\bf XY}{\bf U}^{r}}]=P_{XY}^{(\delta)}P_{U^{r}|XY}.

(83)

Let $\ell(\hat{j}_{i}):=2\lfloor\log_{2}(\hat{j}_{i})\rfloor+1$ be the number of bits need to encode the positive integer $\hat{j}_{i}$ using the Elias gamma code [17]. For each $i\in\{1,\dots,r\}\cap 2\mathbb{Z}$ we have

$\displaystyle\mathbb{E}^{(\delta)}[\ell(\hat{j}_{i})\|{\bf X,Y,}{\bf U}^{i-1}]$	$\displaystyle\leq 2\mathbb{E}^{(\delta)}[\log_{2}\hat{j}_{i}\|{\bf X,Y,}{\bf U}^{i-1}]+1$	(84)
	$\displaystyle\leq 2\log_{2}\mathbb{E}^{(\delta)}[\hat{j}_{i}\|{\bf X,Y,}{\bf U}^{i-1}]+1$	(85)
	$\displaystyle=2\log_{2}\alpha_{i}^{n\widehat{P}_{{\bf XU}_{i-1}}(0,{\bf 0})}+1$	(86)
	$\displaystyle=2n\widehat{P}_{{\bf XU}_{i-1}}(0,{\bf 0})\log_{2}\alpha_{i}+1$	(87)

where (86) follows from the selection rule (42) and the formula of expectation of the geometric distribution. Then

	$\displaystyle\mathbb{E}^{(\delta)}[\ell(\hat{j}_{i})]$	$\displaystyle\leq 2nP_{XU_{i-1}}^{(\delta)}(0,{\bf 0})\log_{2}\alpha_{i}+1$		(88)
		$\displaystyle\leq 2(1+\delta)nP_{XU_{i-1}}^{(0)}(0,{\bf 0})\log_{2}\alpha_{i}+1$		(89)

where (88) used (83); (89) used the fact that $P_{XYU^{r}}^{(\delta)}$ is dominated by $(1+\delta)P_{XYU^{r}}^{(0)}$ . Note that ${\bf 0}$ in (88)-(89) denotes the value of the vector $U^{i-1}$ .

VI-B Expectation of $\Gamma^{\rm B}$

Recall that $\Gamma^{\rm B}$ was defined in (48). Pick arbitrary $i\in\{1,\dots,r\}\setminus 2\mathbb{Z}$ . Since

\displaystyle P_{U^{i}Y}^{(\delta)}:=P_{Y}\prod_{j=1}^{i}P^{(\delta)}_{U_{j}|U^{j-1}Y}

(90)

and since $P_{Y}$ and $(P^{(\delta)}_{U_{j}|U^{j-1}Y})_{j\in\{1,\dots,r\}\cap 2\mathbb{Z}}$ are independent of $\delta$ , we obtain

\displaystyle\partial_{\delta}\ln P_{U^{i}Y}^{(\delta)}(u^{i},y)|_{\delta=0}=\sum_{1\leq j\leq i}^{\rm odd}\Gamma_{j}(u^{j},y).

(91)

Next, observe that for any $l\in\{1,\dots,n\}$ ,

	$\displaystyle\quad\mathbb{E}^{(0)}[\Gamma_{i}(U^{i}(l),Y(l))\|U^{i-1}(l),Y(l)]$
	$\displaystyle=\mathbb{E}^{(0)}\left[\left.\frac{\left.\partial_{\delta}P^{(\delta)}_{U_{i}\|YU^{i-1}}(U_{i}(l)\|Y,U^{i-1}(l))\right\|_{\delta=0}}{P^{(0)}_{U_{i}\|YU^{i-1}}(U_{i}(l)\|Y,U^{i-1}(l))}\right\|U^{i-1}(l),Y(l)\right]$		(92)
	$\displaystyle=\sum_{u_{i}}\left.\partial_{\delta}P^{(\delta)}_{U_{i}\|YU^{i-1}}(u_{i}\|Y,U^{i-1}(l))\right\|_{\delta=0}$		(93)
	$\displaystyle=0.$		(94)

Moreover, for any $\delta\neq 0$ ,

	$\displaystyle\quad\frac{1}{\delta}\sum_{l=1}^{n}\mathbb{E}^{(\delta)}[\Gamma_{i}(U_{1}(l),\dots,U_{i}(l),Y(l))]$
	$\displaystyle=\delta^{-1}n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)P^{(\delta)}_{U^{i}Y}(u^{i},y)$		(95)
	$\displaystyle=n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)\frac{\partial}{\partial\delta}P_{U^{i}Y}^{(\delta)}(u^{i},y)\|_{\delta=0}$		(96)
	$\displaystyle=n\sum_{u^{i},y}\Gamma_{i}(u^{i},y)P_{U^{i}Y}^{(0)}(u^{i},y)\sum_{1\leq j\leq i}^{\rm odd}\Gamma_{j}(u^{j},y)$		(97)
	$\displaystyle=n\sum_{u^{i},y}\Gamma_{i}^{2}(u^{i},y)P_{U^{i}Y}^{(0)}(u^{i},y)$		(98)

where (95) used (83); (96) used (94) and the linearity of $P^{\delta}_{U^{i-1}Y}$ in $\delta$ ; (97) used (91); (98) follows from (94). Thus

\displaystyle\frac{1}{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}]=I^{\rm B},\quad\forall\delta\neq 0

(99)

where we defined

\displaystyle I^{\rm B}

\displaystyle:=n\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}[\Gamma_{i}^{2}(U^{i}(1),Y(1))].

(100)

Lemma 6.

Let $(U^{i},Y)\sim P^{(\delta)}_{U^{i}Y}$ . We have

\displaystyle I^{\rm B}\geq 2n\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1})

(101)

where the logarithmic base of the mutual information is natural.

Proof.

Consider any $i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}$ . we have

	$\displaystyle\quad\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U^{i},Y)\right]$
	$\displaystyle=\mathbb{E}^{(0)}\left[\left(\partial_{\delta}\ln P^{(\delta)}_{U_{i}\|YU^{i-1}}(U_{i}\|YU^{i-1})\|_{\delta=0}\right)^{2}\right]$		(102)
	$\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}\|YU^{i-1}}^{(\delta)}\\|P_{U_{i}\|YU^{i-1}}^{(0)}\|P_{U^{i-1}Y}^{(0)})$		(103)
	$\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}\|YU^{i-1}}^{(\delta)}\\|P_{U_{i}\|U^{i-1}}^{(0)}\|P_{U^{i-1}Y}^{(0)})$		(104)
	$\displaystyle=2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}\|YU^{i-1}}^{(\delta)}\\|P_{U_{i}\|U^{i-1}}^{(0)}\|P_{U^{i-1}Y}^{(\delta)})$		(105)
	$\displaystyle\geq 2\lim_{\delta\to 0}\delta^{-2}D(P_{U_{i}\|YU^{i-1}}^{(\delta)}\\|P_{U_{i}\|U^{i-1}}^{(\delta)}\|P_{U^{i-1}Y}^{(\delta)})$		(106)
	$\displaystyle=\lim_{\delta\to 0}\delta^{-2}I(U_{i};Y;U^{i-1})$		(107)

where we defined the conditional KL divergence

D(P_{Y|X}\|Q_{Y|X}|P_{X}):=\int D(P_{Y|X=x}\|Q_{Y|X=x})dP_{X}(x);

(104) follows since $P^{(0)}_{U_{i}|YU^{i-1}}=P^{(0)}_{U_{i}|U^{i-1}}$ ; (105) follows since $\lim_{\delta\to 0}P^{(\delta)}_{U^{i-1}Y}=P^{(0)}_{U^{i-1}Y}$ . ∎

VI-C Variance of $\Gamma^{\rm B}$

For any $\delta$ , since $({\bf U}^{r},{\bf X,Y})\sim(P^{(\delta)}_{U^{r}XY})^{\otimes n}$ , we have

	$\displaystyle\operatorname{var}^{(\delta)}(\Gamma^{\rm B})$	$\displaystyle=\sum_{l=1}^{n}\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(l),\dots,U_{i}(l),Y(l))\right)$		(108)
		$\displaystyle=n\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right).$		(109)

However,

	$\displaystyle\quad\operatorname{var}^{(\delta)}\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)$
	$\displaystyle\leq\mathbb{E}^{(\delta)}\left[\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)^{2}\right]$		(110)
	$\displaystyle\leq(1+\delta)\mathbb{E}^{(0)}\left[\left(\sum_{1\leq i\leq r}^{\rm odd}\Gamma_{i}(U_{1}(1),\dots,U_{i}(1),Y(1))\right)^{2}\right]$		(111)
	$\displaystyle\leq(1+\delta)\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U_{1}(1),\dots,U_{i}(1),Y(1))\right]$		(112)

where (111) follows since $P^{(\delta)}$ is dominated by $(1+\delta)P^{(0)}$ ; (112) used (94). Therefore

	$\displaystyle\quad\operatorname{var}^{(\delta)}(\Gamma^{\rm B})$
	$\displaystyle\leq n(1+\delta)\sum_{1\leq i\leq r}^{\rm odd}\mathbb{E}^{(0)}\left[\Gamma_{i}^{2}(U_{1}(1),\dots,U_{i}(1),Y(1))\right]$		(113)
	$\displaystyle=(1+\delta)I^{\rm B}.$		(114)

VI-D $r=1$ Case

We now prove achievability bounds for the Bernoulli distribution estimation algorithm.

Corollary 7.

Given $m_{1},m_{2}>10$ , for $r=1$ and $\alpha_{1}:=\frac{m_{1}}{10}$ , the mean square error $\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}]\leq\frac{25(1+\delta)m_{1}m_{2}}{n}$ and total communication cost $\mathbb{E}[|\Pi_{1}|]\leq\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1$ .

Proof.

We have

$\displaystyle\mathbb{E}[\|\hat{\delta}^{\rm B}-\delta\|^{2}]$	$\displaystyle=\operatorname{var}^{(\delta)}(\Gamma^{\rm B})\cdot(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}])^{-2}$	(115)
	$\displaystyle\leq(1+\delta)I^{\rm B}\cdot(I^{\rm B})^{-2}$	(116)
	$\displaystyle\leq(1+\delta)\left[2n\lim_{\delta\to 0}\delta^{-2}I(U_{1};Y)\right]^{-1}$	(117)
	$\displaystyle\leq\frac{25(1+\delta)m_{1}m_{2}}{n}$	(118)

where (115) follows since $\hat{\delta}^{\rm B}$ is unbiased (Theorem 3); (116) follows from (99) and (114); (117) follows from Lemma 6; lastly we used Corollary 4.

As for the communication cost

	$\displaystyle\mathbb{E}[\|\Pi^{\rm A\to B}\|]$	$\displaystyle\leq 2(1+\delta)n\sum_{1\leq i\leq 1}P_{XU_{i-1}}^{(0)}(0,0)\log_{2}\alpha_{i}+1$		(119)
		$\displaystyle\leq\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1$		(120)

where we used (89) and Corollary 4. ∎

VI-E $r=\infty$ Case

Corollary 8.

Let $m_{1},m_{2}>10$ , $m:=\min\{m_{1},m_{2}\}$ . For $r$ , $(\alpha_{i})$ defined in Section V-C, the mean square error $\mathbb{E}[|\hat{\delta}^{\rm B}-\delta|^{2}]\leq\frac{25(1+\delta)m_{1}m_{2}^{2}}{nm}$ and total communication cost $\mathbb{E}[|\Pi^{r}|]\leq 6(1+\delta)n(m_{1}^{-1}+m_{2}^{-1})\log_{2}e+\frac{r+1}{2}$ .

Proof.

The bound on the mean square error is similar to the $r=1$ case:

$\displaystyle\mathbb{E}[\|\hat{\delta}^{\rm B}-\delta\|^{2}]$	$\displaystyle\leq\operatorname{var}^{(\delta)}(\Gamma^{\rm B})\cdot(\partial_{\delta}\mathbb{E}^{(\delta)}[\Gamma^{\rm B}])^{-2}$	(121)
	$\displaystyle\leq(1+\delta)I^{\rm B}\cdot(I^{\rm B})^{-2}$	(122)
	$\displaystyle\leq(1+\delta)\left[2n\lim_{\delta\to 0}\delta^{-2}\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})\right]^{-1}$	(123)
	$\displaystyle\leq\frac{25(1+\delta)m_{1}^{2}m_{2}}{mn}$	(124)

except that we use Corollary 5 in the last step.

For the communication cost,

	$\displaystyle\mathbb{E}[\|\Pi^{\rm A\to B}\|]$	$\displaystyle\leq 2(1+\delta)n\sum_{1\leq i\leq r}^{\rm odd}P_{XU_{i-1}}^{(0)}(0,0)\log_{2}\alpha_{i}+\frac{r+1}{2}$		(125)
		$\displaystyle\leq 6(1+\delta)n(m_{1}^{-1}+m_{2}^{-1})\log_{2}e+\frac{r+1}{2}$		(126)

where used (89) and Corollary 5. ∎

VII Density Estimation Upper Bounds

In this section we prove the upper bounds in Theorem 1 and Theorem 2, by building nonparametric density estimators based on the Bernoulli distribution estimator. For $\beta\in(0,2]$ , the rectangular kernel is minimax optimal (Section II-B), so that the integral with the kernel can be directly estimated using the Bernoulli distribution estimator, which we explain in Section VII-B and VII-C. Extension to higher order kernels is possible using a linear combination of rectangular kernels, which is explained in Section VII-D.

VII-A Density Lower Bound Assumption

First, we observe the following simple argument showing that it suffices to consider $A>0$ . Define

\displaystyle B:=\sup_{x,y\in[0,1]^{d}}\sup_{p_{XY}}p_{XY}(x,y),

(127)

where the supremum is over all density $p_{XY}$ on $\mathbb{R}^{2d}$ satisfying $\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L$ . Clearly $B>1$ is finite and depends only on $\beta,L,d$ .

Lemma 9.

In either the one-way or the interactive setting, suppose that there exists an algorithm achieving $\max_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq R$ for some $R>0$ and $A\in(0,1)$ . Then, there must be an algorithm achieving $\max_{p_{XY}\in\mathcal{H}(\beta,L,0)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}]\leq\left(\frac{1+A}{1-A}\right)^{2}R$ .

Proof.

Pick one $p_{XY}$ such that $\|p_{XY}\|_{[0,1]^{2d},\beta}\leq L$ and $\inf_{x,y\in[0,1]^{d}}p_{XY}(x,y)\geq\frac{1+A}{2}>A$ . Since $\frac{1+A}{2}<1$ , such $p_{XY}$ exists. Consider an arbitrary $q_{XY}\in\mathcal{H}(\beta,L,0)$ , and suppose infinite pairs $(X_{1},Y_{1}),\dots$ i.i.d. according to $q_{XY}$ are available to Alice and Bob. Using the common randomness, Alice and Bob can simulate i.i.d. pairs $(\tilde{X}_{1},\tilde{Y}_{1}),\dots$ according to $\tilde{p}_{XY}:=\frac{2A}{1+A}p_{XY}+\frac{1-A}{1+A}q_{XY}$ , by replacing each pair with probability $\frac{2A}{1+A}$ with a new pair drawn according to $p_{XY}$ . Clearly $\tilde{p}_{XY}\in\mathcal{H}(\beta,L,A)$ , and by assumption, $\tilde{p}_{XY}(x_{0},y_{0})$ can be estimated with mean square risk $R$ . Since $p_{XY}$ is known, this implies that $q_{XY}(x_{0},y_{0})$ can be estimated with mean square risk $\left(\frac{1+A}{1-A}\right)^{2}R$ . ∎

For the rest of the section, we will assume that there is a density lower bound $A>0$ and $p_{XY}\in\mathcal{H}(\beta,L,A)$ , which is sufficient in view of Lemma 9. Consider bandwidth $h>0$ (which will be specified later as an inverse polynomial of $k$ ). Also introduce the notations

	$\displaystyle\mathcal{A}$	$\displaystyle:=x_{0}+h[-1,1]^{d};$		(128)
	$\displaystyle\mathcal{B}$	$\displaystyle:=y_{0}+h[-1,1]^{d}.$		(129)

Define $P_{\underline{X}\underline{Y}}$ as the probability distribution induced by $P_{XY}$ and with

	$\displaystyle\underline{X}$	$\displaystyle:=1\{X\notin\mathcal{A}\};$		(130)
	$\displaystyle\underline{Y}$	$\displaystyle:=1\{Y\notin\mathcal{B}\}.$		(131)

Define

	$\displaystyle m_{1}$	$\displaystyle:=\mathbb{P}^{-1}(\underline{X}=0);$		(132)
	$\displaystyle m_{2}$	$\displaystyle:=\mathbb{P}^{-1}(\underline{Y}=0).$		(133)

Note that $m_{1}/m_{2}$ is bounded above and below by positive constants depending on $A$ , $\beta$ , and $L$ (see (138) and (143)). Also, we can assume Alice and Bob both know $m_{1}$ and $m_{2}$ , since with infinite samples Alice and Bob know their marginal densities $p_{X}$ and $p_{Y}$ , and Alice can send $m_{1}$ to Bob with very high precision using negligible number of bits. Let $\delta\geq-1$ be the number such that $P_{\underline{X}\underline{Y}}$ is the matrix

\displaystyle\left(\begin{array}[]{cc}\frac{1}{m_{1}m_{2}}(1+\delta)&\frac{1}{m_{1}}(1-\frac{1}{m_{2}})-\frac{\delta}{m_{1}m_{2}}\\ \frac{1}{m_{2}}(1-\frac{1}{m_{1}})-\frac{\delta}{m_{1}m_{2}}&(1-\frac{1}{m_{2}})^{2}+\frac{\delta}{m_{1}m_{2}}\end{array}\right).

(136)

Let $\hat{\delta}^{\rm B}$ be Bob’s estimator of $\delta$ in (52). Then we define Bob’s density estimator:

\displaystyle\hat{p}^{\rm B}:=\frac{1+\hat{\delta}^{\rm B}}{m_{1}m_{2}h^{2d}}.

(137)

We next show that the smoothness of the density ensures that $1+\delta$ is at most the order of a constant. Recall that $A$ is a density lower bound on $p_{X}$ and $p_{Y}$ . Define $M:=\max\{m_{1},m_{2}\}$ and $m:=\min\{m_{1},m_{2}\}$ . The definition of $(m_{1},m_{2})$ implies $Ah^{d}\leq\frac{1}{M}$ , and hence

\displaystyle h\leq\left(\frac{1}{AM}\right)^{1/d}.

(138)

Recall $B$ defined in (127). We then have

$\displaystyle 1+\delta$	$\displaystyle=m_{1}m_{2}P_{XY}(\mathcal{A}\times\mathcal{B})$	(139)
	$\displaystyle\leq m_{1}m_{2}Bh^{2d}$	(140)
	$\displaystyle\leq\frac{Bm_{1}m_{2}}{A^{2}m^{2}}$	(141)
	$\displaystyle=\frac{BM}{A^{2}m}.$	(142)

Next, observe that $1=m_{1}P_{X}(\mathcal{A}\times[-1,1]^{d})\leq m_{1}Bh^{d}$ which yields $h^{d}\geq\frac{1}{m_{1}B}$ . Similarly we also have $h^{d}\geq\frac{1}{m_{2}B}$ , therefore

\displaystyle h^{d}\geq\frac{1}{mB}

(143)

Together with (138), we see that $h^{d}=\Theta(1/m)=\Theta(1/M)$ .

Next, the bias of the density estimator is

\displaystyle\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})=\frac{P_{XY}(\mathcal{A}\times\mathcal{B})}{h^{2d}}-p_{XY}(x_{0},y_{0})

(144)

which is just the bias of the rectangular kernel estimator (with bandwidth $h$ in each of the two subspaces). The rectangle kernel is order $1$ [44, Definition 1.3] and compactly supported while, by assumption $\beta\in(0,2]$ , and therefore the bias is bounded by ([44, Proposition 1.2])

\displaystyle|\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})|\leq Ch^{\beta}

(145)

where $C$ is a constant depending only on $\beta,d$ and $L$ .

VII-B One-Way Case

By Corollary 7 and (143), we can bound the variance of the density estimator as

$\displaystyle\operatorname{var}(\hat{p}^{\rm B})$	$\displaystyle=\frac{1}{m_{1}^{2}m_{2}^{2}h^{4d}}\operatorname{var}(\hat{\delta}^{\rm B})$	(146)
	$\displaystyle\leq\frac{1}{m_{1}^{2}m_{2}^{2}h^{4d}}\cdot\frac{25(1+\delta)m_{1}m_{2}}{n}$	(147)
	$\displaystyle\leq\frac{25(1+\delta)B^{4}m^{3}}{nM}$	(148)

where (148) used (143). Also by Corollary 7, the communication constraint is satisfied if the following holds

\displaystyle\frac{2.2(1+\delta)n}{m_{1}}\log_{2}\frac{m_{1}}{10}+1\leq k.

(149)

Now we can choose $h$ so that $m_{1}=(\frac{k}{\log_{2}k})^{\frac{d}{d+2\beta}}$ as defined by (132), and set

\displaystyle n

\displaystyle=\left\lfloor\left(\frac{2.2(1+\delta_{\rm max})}{m_{1}}\log_{2}\frac{m_{1}}{10}\right)^{-1}(k-1)\right\rfloor

(150)

where $\delta_{\rm max}$ , defined as the right side of (142) and hence depends only on $(A,\beta,L)$ , is an upper estimate of the true parameter $\delta$ . Then the communication constraint is satisfied. Moreover by the bias (145) and the variance (148) bounds, the risk is bounded by

	$\displaystyle\quad\|\mathbb{E}[\hat{p}^{\rm B}]-p_{XY}(x_{0},y_{0})\|^{2}+\operatorname{var}(\hat{p}^{\rm B})$
	$\displaystyle\leq C^{2}h^{2\beta}+\frac{25(1+\delta)B^{4}m^{3}}{nM}$		(151)
	$\displaystyle=(Am)^{-2\beta/d}+\frac{25(1+\delta)B^{4}m^{3}}{nM}$		(152)
	$\displaystyle\leq D(\frac{k}{\log k})^{-\frac{2\beta}{d+2\beta}}$		(153)

where $D$ is a constant depending only on $\beta$ , $L$ , and $A$ , and we used the fact that $\delta$ is bounded above by (143) and the bound on $h$ shown in (138). This proves the upper bound in Theorem 1 for $\beta\in(0,2]$ .

VII-C Interactive Case

Choose $h$ such that $m$ as defined by $m:=\min\{m_{1},m_{2}\}$ and (132)-(133) satisfies

\displaystyle m:=k^{\frac{d}{d+2\beta}},

(154)

and set

\displaystyle n:=\left\lfloor\frac{mk\ln 2}{13(1+\delta_{\rm max})}\right\rfloor

(155)

where as before $\delta_{\rm max}$ is an upper bound on $\delta$ and only depends on $(A,\beta,L)$ . By Corollary 8, for $k$ large enough we have $m>10$ , and the communication cost is bounded by $k$ . Moreover from (152), the risk is bounded by $Dk^{-\frac{2\beta}{d+2\beta}}$ for some $D$ depending only on $\beta$ , $L$ , and $A$ . This proves the upper bound in Theorem 2 when $\beta\in(0,2]$ .

VII-D Extension to $\beta>2$

For $\beta>2$ , the rectangular kernel is no longer minimax optimal. However, observe the following:

Proposition 10.

For any positive integers $d$ and $l$ , there exists an order $l$ -kernel in $\mathbb{R}^{d}$ which is a linear combination of $(\lfloor l/2\rfloor+1)^{d}$ indicator functions.

Proof.

In the following we prove for $d=1$ ; the case of general $d$ will then follow by taking the tensor product of kernel functions in $\mathbb{R}$ . Note that an $l$ -th kernel must satisfy the following equations:

	$\displaystyle\int K(u)du$	$\displaystyle=1;$		(156)
	$\displaystyle\int u^{j}K(u)du$	$\displaystyle=0,\quad j=1,\dots,l.$		(157)

Let use consider $K$ of the following form:

\displaystyle K(u)=\sum_{k=1}^{k_{0}}c_{k}1_{[-k,k]}

(158)

where $k_{0}:=\lfloor l/2\rfloor+1$ . Since such $K(u)$ is an even function, (156)-(156) yield $k_{0}$ nontrivial equations for $c_{1},\dots,c_{k_{0}}$ (i.e., only when $j$ is even):

	$\displaystyle 2\sum_{k=1}^{k_{0}}kc_{k}=1;$			(159)
	$\displaystyle\sum_{k=1}^{k_{0}}\frac{2k^{j+1}}{j+1}c_{k}$	$\displaystyle=0,\quad j\in\{1,\dots,l\}\cap 2\mathbb{Z}.$		(160)

From the formula for the determinant of the Vandermonde matrix, we see that these equations have a unique solution for $c_{1},\dots,c_{k_{0}}$ . ∎

Now for general $\beta>0$ , we can take an order $l=\lfloor\beta\rfloor$ kernel as in Proposition 10. We can estimate $\frac{1}{h^{2d}}\int p_{XY}(x,y)K(\frac{(x,y)-(x_{0},y_{0})}{h})$ by applying the Bernoulli distribution estimator repeatedly for $(\lfloor l/2\rfloor+1)^{2d}$ times. Therefore by the similar arguments as the preceding sections we see that the upper bounds in Theorem 1 and Theorem 2 hold for $\beta>2$ as well.

VIII One-way Density Estimation Lower Bound

VIII-A Upper Bounding $s^{*}(X;Y)$

The pointwise estimation lower bound is obtained by lower bounding the risk for estimating $P_{\underline{X}\underline{Y}}$ (with $\underline{X}$ and $\underline{Y}$ being indicators of neighborhoods of $x_{0}$ and $y_{0}$ ), and applying Le Cam’s inequality to the latter. Therefore we are led to considering the strong data processing constant for the biased Bernoulli distribution.

Theorem 5.

Let $P_{XY}^{(\delta)}$ be as defined in (31). where $\delta\in(-1,1)$ and $m>1$ . Then $s^{*}(X;Y)\leq\frac{\delta^{2}}{m\ln m-m+1}$ .

Proof.

For this proof we can assume without loss of generality that the logarithms are natural. For any $Q_{X}$ , let $Q_{Y}$ be the output through the channel $P_{Y|X}$ . Then

$\displaystyle s^{*}(X;Y)$	$\displaystyle\leq\frac{D(Q_{Y}\\|P_{Y})}{D(Q_{X}\\|P_{X})}$	(161)
	$\displaystyle\leq\frac{D_{\chi^{2}}(Q_{Y}\\|P_{Y})}{D_{\chi^{2}}(Q_{X}\\|P_{X})}\cdot\frac{D_{\chi^{2}}(Q_{X}\\|P_{X})}{D(Q_{X}\\|P_{X})}$	(162)
	$\displaystyle\leq\frac{\delta^{2}}{m\ln m-m+1}$	(163)

where we define the $\chi^{2}$ divergence in (7). The justification of the steps are as follows: (161) is well-known. (162) follows since the $\chi^{2}$ divergence dominates the KL divergence (see e.g. [39]). To see (163), note that $\frac{D_{\chi^{2}}(Q_{Y}\|P_{Y})}{D_{\chi^{2}}(Q_{X}\|P_{X})}$ is upper bounded by $\rho_{\sf m}^{2}(X,Y)$ , the square of the maximal correlation coefficient (see e.g. [5, 6]). As the operator norm of a linear operator, $\rho_{\sf m}(X,Y)$ can be shown to equal the second largest singular value of

	$\displaystyle{\bf M}$	$\displaystyle:=\left(\frac{1}{\sqrt{P_{X}(x)}}P_{XY}(x,y)\frac{1}{\sqrt{P_{Y}(y)}}\right)_{x,y}$		(164)
		$\displaystyle=\begin{bmatrix}\frac{1+\delta}{m}&\\ &1-\frac{1}{m}+\frac{\delta}{m(m-1)}\end{bmatrix};$		(165)

see e.g. [6]. Since ${\bf M}$ is a symmetric matrix, its singular values are its eigenvalues. The largest eigenvalue of ${\bf M}$ is 1, corresponding to the eigenvector $(\sqrt{P_{X}(0)},\sqrt{P_{X}(1)})$ (which is evident from (164)), whereas the trace

\displaystyle\operatorname{tr}({\bf M})=\frac{1+\delta}{m}+1-\frac{1}{m}+\frac{\delta}{m(m-1)}=1+\frac{\delta}{m-1}

(166)

which is evident from (165). Therefore $\rho_{\rm m}(X;Y)=\frac{\delta}{m-1}$ . Moreover, since $\chi^{2}$ and $KL$ divergences are both $f$ -divergences, their ratio can be bounded by the ratio of their corresponding $f$ -functions (see e.g. [39]):

	$\displaystyle\frac{D_{\chi^{2}}(Q_{X}\\|P_{X})}{D(Q_{X}\\|P_{X})}$	$\displaystyle\leq\sup_{0<t\leq m}\frac{(t-1)^{2}}{t\ln t-t+1}$		(167)
		$\displaystyle=\frac{(m-1)^{2}}{m\ln m-m+1};$		(168)

The constraint $t\leq m$ in (167) is because $\min_{x}P_{X}(x)=\frac{1}{m}$ and $\max_{x}\frac{Q_{X}(x)}{P_{X}(x)}\leq m$ . To show (168), it suffices to show that

\inf_{u\in(-1,m-1]}\frac{(1+u)\ln(1+u)-u}{u^{2}}

is achieved at $u=m-1$ . For this, it suffices to show that the derivative of the objective function, $\frac{-(2+u)\ln(1+u)+2u}{u^{3}}$ is negative on $(-1,m-1]$ . Indeed, define $\phi(u):=(2+u)\ln(1+u)-2u$ . We can check that $\phi(0)=0$ , $\phi^{\prime}(0)=0$ , and $\phi^{\prime\prime}(u)=\frac{u}{(1+u)^{2}}$ , which imply that $\phi(u)>0$ for $u>0$ and $\phi(u)<0$ for $u<0$ . Therefore (168), and hence (163), holds. ∎

VIII-B Lower Bounding One-Way NP Estimation Risk

Given $k,d,\beta,L,A$ , consider a distribution $P_{\underline{X}\underline{Y}}$ on $\{0,1\}^{2}$ with matrix

\displaystyle\left(\begin{array}[]{cc}\frac{1}{m^{2}}(1+\delta)&\frac{1}{m}(1-\frac{1}{m})-\frac{\delta}{m^{2}}\\ \frac{1}{m}(1-\frac{1}{m})-\frac{\delta}{m^{2}}&(1-\frac{1}{m})^{2}+\frac{\delta}{m^{2}}\end{array}\right),

(171)

where $m:=\left(\frac{ak}{\ln k}\right)^{\frac{d}{2\beta+d}}$ and $\delta:=m^{-\frac{\beta}{d}}$ , with $a:=\frac{16\beta+8d}{d}\ln 2$ being a constant. We then need to “simulate” smooth distributions from $P_{\underline{X}\underline{Y}}$ . Let $f\colon\mathbb{R}^{d}\to[0,\infty)$ be a function satisfying the following properties:

•

$f$ has a compact support;
•

$\int_{\mathbb{R}^{d}}f=1$ ;
•

$f(0)>0$ ;
•

$f(x)\in[0,1]$ , for all $x\in\mathbb{R}^{d}$ ;
•

$\|f\|_{\mathbb{R}^{d},\beta}<\frac{L}{4}$ ;
•

Define $g(x,y)=f(x)f(y)$ as a function on $\mathbb{R}^{2d}$ . Then $\|g\|_{\mathbb{R}^{2d},\beta}<\frac{L}{4}$ .

Clearly, such a function exists for any given $\beta,L,d$ . For sufficiently large $m$ such that $m^{-1/d}\operatorname{supp}(f)+x_{0}\in[0,1]^{d}$ and $m^{-1/d}\operatorname{supp}(f)+y_{0}\in[0,1]^{d}$ (recall that $(x_{0},y_{0})$ is the given point in the density estimation problem), define

	$\displaystyle p_{X\|\underline{X}=0}(x):=\frac{1}{P_{\underline{X}}(0)}f(m^{\frac{1}{d}}(x-x_{0})),\quad\forall x\in\mathbb{R}^{d};$		(172)
	$\displaystyle p_{Y\|\underline{Y}=0}(y):=\frac{1}{P_{\underline{Y}}(0)}f(m^{\frac{1}{d}}(y-y_{0})),\quad\forall y\in\mathbb{R}^{d}.$		(173)

Since $P_{\underline{X}}(0)=P_{\underline{Y}}(0)=\frac{1}{m}$ , clearly the above define valid probability densities supported on $[0,1]^{d}$ . Define

	$\displaystyle p_{X\|\underline{X}=1}(x):=\frac{1\{x\in[0,1]^{d}\}}{P_{\underline{X}}(1)}[1-f(m^{1/d}(x-x_{0}))];$		(174)
	$\displaystyle p_{Y\|\underline{Y}=1}(y):=\frac{1\{y\in[0,1]^{d}\}}{P_{\underline{Y}}(1)}[1-f(m^{1/d}(y-y_{0}))],$		(175)

which are also probability densities supported on $[0,1]^{d}$ . Define $P_{XY|\underline{XY}}=P_{X|\underline{X}}P_{Y|\underline{Y}}$ , where $P_{X|\underline{X}}$ and $P_{Y|\underline{Y}}$ are conditional distributions defined by the densities above. Under the joint distribution $P_{XY\underline{XY}}$ , we have

\displaystyle p_{X}(x)=p_{Y}(y)=1,\quad\forall x,y\in[0,1]^{d}.

(176)

Define

\displaystyle\bar{P}_{XY\underline{XY}}=P_{X|\underline{X}}P_{Y|\underline{Y}}P_{\underline{X}}P_{\underline{Y}}.

(177)

We now check that the density of $P_{XY}$ satisfies $\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L$ for $m$ sufficiently large. Indeed, for $x,y\in[0,1]^{d}$ ,

	$\displaystyle\quad p_{XY}(x,y)$
	$\displaystyle=\sum_{i,j=0,1}p_{XY\|\underline{XY}=(i,j)}(x,y)P_{\underline{XY}}(i,j)$		(178)
	$\displaystyle=\sum_{i,j=0,1}p_{XY\|\underline{XY}=(i,j)}(x,y)\bar{P}_{\underline{XY}}(i,j)$
	$\displaystyle\quad+\tfrac{\delta}{m^{2}}(p_{X\|\underline{X}=0}(x)-p_{X\|\underline{X}=1}(x))(p_{Y\|\underline{Y}=0}(y)-p_{Y\|\underline{Y}=1}(y))$		(179)
	$\displaystyle=1+\tfrac{\delta}{m^{2}}(p_{X\|\underline{X}=0}(x)-p_{X\|\underline{X}=1}(x))$
	$\displaystyle\quad\cdot(p_{Y\|\underline{Y}=0}(y)-p_{Y\|\underline{Y}=1}(y))$		(180)
	$\displaystyle=1+\delta\left[-\tfrac{1}{m-1}+\tfrac{1}{1-\tfrac{1}{m}}f(m^{1/d}(x-x_{0}))\right]$
	$\displaystyle\quad\cdot\left[-\tfrac{1}{m-1}+\tfrac{1}{1-\tfrac{1}{m}}f(m^{1/d}(y-y_{0}))\right]$		(181)
	$\displaystyle={\rm const.}-\frac{\delta m}{(m-1)^{2}}f(m^{1/d}(x-x_{0}))$
	$\displaystyle\quad-\frac{\delta m}{(m-1)^{2}}f(m^{1/d}(y-y_{0}))$
	$\displaystyle\quad+\frac{\delta}{(1-1/m)^{2}}f(m^{1/d}(x-x_{0}))f(m^{1/d}(y-y_{0})).$		(182)

By the assumptions on $f$ , we see that

$\displaystyle\\|m^{-\beta/d}f(m^{1/d}(\cdot-x_{0}))\\|_{(0,1)^{d},\beta}$	$\displaystyle\leq\frac{L}{4};$	(183)
$\displaystyle\\|m^{-\beta/d}f(m^{1/d}(\cdot-y_{0}))\\|_{(0,1)^{d},\beta}$	$\displaystyle\leq\frac{L}{4};$	(184)
$\displaystyle\\|m^{-\beta/d}f(m^{1/d}(\cdot-x_{0}))f(m^{1/d}(*-y_{0}))\\|_{(0,1)^{2d},\beta}$	$\displaystyle\leq\frac{L}{4}.$	(185)

Therefore with the choice $\delta=m^{-\beta/d}$ , we have $\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L$ for $m\geq 10$ .

Now we can apply a Le Cam style argument [51] for the estimation lower bound. Suppose that there exists an algorithm that estimates the density at $(x_{0},y_{0})$ as $\hat{p}$ . Alice and Bob can convert this to an algorithm for testing the binary distributions $P_{\underline{XY}}$ against $\bar{P}_{\underline{XY}}$ . Indeed, suppose that ${\bf(\underline{X},\underline{Y})}$ is an infinite sequence of i.i.d. random variable pairs according to either $P_{\underline{XY}}$ or $\bar{P}_{\underline{XY}}$ . Using the local randomness (which is implied by the common randomness), Alice and Bob can simulate the sequence of i.i.d. random variables ${\bf(X,Y)}$ according to either $P_{XY}$ or $\bar{P}_{XY}$ , by applying the random transformations $P_{X|\underline{X}}$ and $P_{Y|\underline{Y}}$ coordinate-wise. Then Alice and Bob can apply the density estimation algorithm to obtain $\hat{p}$ . Note that $\bar{p}_{XY}(x_{0},y_{0})=1$ while

\displaystyle p_{XY}(x_{0},y_{0})=1+\delta\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{2},

(186)

the latter following from (181). Now suppose that Bob declares $P_{\underline{XY}}$ if

\displaystyle|\hat{p}-p_{XY}(x_{0},y_{0})|\leq|\hat{p}-1|,

(187)

and $\bar{P}_{\underline{XY}}$ otherwise. By Chebyshev’s inequality, the error probability (under either hypothesis) is upper bounded by

\displaystyle 4\delta^{-2}\left[\tfrac{m}{m-1}f(0)-\tfrac{1}{m-1}\right]^{-4}\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[|\hat{p}-p_{XY}(x_{0},y_{0})|^{2}].

(188)

On the other hand, from (22) and Theorem 5 we have

$\displaystyle\frac{D(P_{{\bf\underline{Y}}\Pi}\\|\bar{P}_{{\bf\underline{Y}}\Pi})}{H(\Pi)}$	$\displaystyle\leq s^{*}(\underline{X};\underline{Y})$	(189)
	$\displaystyle\leq\frac{\delta^{2}}{m\ln m-m+1}$	(190)
	$\displaystyle\leq\frac{2\delta^{2}}{m\cdot\frac{d}{4\beta+2d}\ln k}$	(191)
	$\displaystyle\leq\frac{8\beta+4d}{dak}$	(192)

when $m$ is sufficiently large. However, it is known (from Kraft’s inequality, see e.g. [14]) that the expected length of a prefix code upper bounds the entropy. Thus

\displaystyle k\geq\mathbb{E}[|\Pi|]\geq\frac{1}{\log 2}H(\Pi)

(193)

and therefore

\displaystyle D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})\leq\frac{8\beta+4d}{da}\log 2.

(194)

Then by Pinsker’s inequality (e.g. [44]),

$\displaystyle 1-\int dP_{{\bf\underline{Y}}\Pi}\wedge d\bar{P}_{{\bf\underline{Y}}\Pi}$	$\displaystyle\leq\sqrt{\frac{1}{2\log e}D(P_{{\bf\underline{Y}}\Pi}\\|\bar{P}_{{\bf\underline{Y}}\Pi})}$	(195)
	$\displaystyle\leq\sqrt{\frac{4\beta+2d}{da}\ln 2}$	(196)
	$\displaystyle=\frac{1}{2}$	(197)

where the last line follows from our choice $a=\frac{16\beta+8d}{d}\ln 2$ . However, $\int dP_{{\bf\underline{Y}}\Pi}\wedge d\bar{P}_{{\bf\underline{Y}}\Pi}$ lower bounds twice of (188). Therefore we have

	$\displaystyle\quad\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[\|\hat{p}-p_{XY}(x_{0},y_{0})\|^{2}]$
	$\displaystyle\geq\frac{\delta^{2}}{8}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}\cdot\frac{1}{2}$		(198)
	$\displaystyle=\frac{1}{16}m^{-2\beta/d}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}$		(199)

which is lower bounded by $\frac{1}{17}m^{-2\beta/d}f^{4}(0)=\frac{f^{4}(0)}{17}\left(\frac{ak}{\ln k}\right)^{-\frac{2\beta}{2\beta+d}}$ for large enough $k$ . Since $a$ and $f(0)$ depend only on $d,\beta,L$ , this establishes the lower bound in Theorem 1.

IX Interactive Density Estimation Lower Bound

In this section we prove the lower bound in Theorem 2.

IX-A Upper Bounding $s^{*}_{\infty}(X;Y)$

The heart of the proof is the following technical result:

Theorem 6.

There exists $c>0$ small enough such that the following holds: For any $P_{XY}$ which is a distribution on $\{0,1\}^{2}$ corresponding to the following matrix:

\displaystyle\left(\begin{array}[]{cc}p^{2}(1+\delta)&p\bar{p}-p^{2}\delta\\ p\bar{p}-p^{2}\delta&\bar{p}^{2}+p^{2}\delta\end{array}\right)

(202)

where $p,|\delta|\in[0,c)$ and we used the notation $\bar{p}:=1-p$ , we have

\displaystyle s_{\infty}^{*}(X;Y)\leq c^{-1}p\delta^{2}.

(203)

The proof can be found in Appendix B.

IX-B Lower Bounding Interactive NP Estimation Risk

The proof is similar to the one-way case (Section VIII-B). Consider the distribution $P_{\underline{XY}}$ on $\{0,1\}^{2}$ as in (171). Let $m:=(ak)^{\frac{d}{2\beta+d}}$ and $\delta:=m^{-\frac{\beta}{d}}$ , where $a=\frac{2\ln 2}{c}$ with $c$ being the absolute constant in Theorem 6. Pick the function $f$ , and define $P_{\underline{XY}XY}$ and $\bar{P}_{\underline{XY}XY}$ as before. Note that, as before, $\bar{p}_{XY}$ is uniform on $[0,1]^{2d}$ , while $\|p_{XY}\|_{(0,1)^{2d},\beta}\leq L$ for $m\geq 10$ . $p_{XY}(x_{0},y_{0})$ has the same formula (186), and Alice and Bob can convert a (now interactive) density estimation algorithm to an algorithm for testing $P_{\underline{XY}}$ against $\bar{P}_{\underline{XY}}$ . With the same testing rule (187), the error probability under either hypothesis is again upper bounded by (188).

Changes arise in (189), where we shall apply Theorem 6 instead. Note that for the absolute constant $c$ in Theorem 6, the condition $\frac{1}{m},|\delta|<c$ is satisfied for sufficiently large $k$ (hence sufficiently large $m$ ).

$\displaystyle\frac{D(P_{{\bf\underline{Y}}\Pi}\\|\bar{P}_{{\bf\underline{Y}}\Pi})}{H(\Pi)}$	$\displaystyle\leq s^{*}_{\infty}(\underline{X};\underline{Y})$	(204)
	$\displaystyle\leq\frac{c^{-1}\delta^{2}}{m}$	(205)
	$\displaystyle\leq(cak)^{-1}.$	(206)

Again using Kraft’s inequality to Bound $H(\Pi)$ , we obtain

\displaystyle D(P_{{\bf\underline{Y}}\Pi}\|\bar{P}_{{\bf\underline{Y}}\Pi})\leq\frac{\log 2}{ca}.

(207)

Then Pinsker’s inequality yields

\displaystyle 1-\int dP_{\underline{\bf Y}\Pi}\wedge d\bar{P}_{\underline{\bf Y}\Pi}\leq\sqrt{\frac{\ln 2}{2ca}}=\frac{1}{2}

(208)

since we selected $a=\frac{2\ln 2}{c}$ . Again $\int dP_{\underline{\bf Y}\Pi}\wedge d\bar{P}_{\underline{\bf Y}\Pi}$ lower bounds twice of (188), therefore

	$\displaystyle\quad\sup_{p_{XY}\in\mathcal{H}(\beta,L,A)}\mathbb{E}[\|\hat{p}-p_{XY}(x_{0},y_{0})\|^{2}]$
	$\displaystyle\geq\frac{\delta^{2}}{8}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}\cdot\frac{1}{2}$		(209)
	$\displaystyle=\frac{1}{16}m^{-2\beta/d}\left[\frac{m}{m-1}f(0)-\frac{1}{m-1}\right]^{4}$		(210)
	$\displaystyle\geq\frac{f^{4}(0)}{17}(ak)^{-\frac{2\beta}{2\beta+d}}$		(211)

where the last line holds for sufficiently large $k$ . Since $a$ is a universal constant and $f$ depends on $d,\beta,L$ only, this completes the proof of the interactive lower bound.

X Acknowledgement

The author would like to thank Professor Venkat Anantharam for bringing the reference [36] to my attention and some interesting discussions. This research was supported by the starting grant from the Department of Statistics, University of Illinois, Urbana-Champaign.

Appendix A Proof of Theorem 4

First, assume that $i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}$ . By the definitions of $P_{U_{i}|XU^{i-1}}$ and $P_{U_{i}|YU^{i-1}}$ , we can verify that the following holds (for $\delta=0$ ):

\displaystyle P_{X|U^{i}}^{(0)}(0|{\bf 0})=\frac{P_{X}(0)}{P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}.

(212)

Indeed, (212) follows by applying induction on the following

	$\displaystyle P^{(0)}_{X\|U^{i}}(0\|{\bf 0})$	$\displaystyle=\frac{p_{0}}{p_{0}+p_{1}}$		(213)
		$\displaystyle=\frac{P^{(0)}_{X\|U^{i-2}}(0\|{\bf 0})}{P^{(0)}_{X\|U^{i-2}}(0\|{\bf 0})+P^{(0)}_{X\|U^{i-2}}(1\|{\bf 0})\alpha_{i}^{-1}}$		(214)

where $p_{0}:=P^{(0)}_{X|U^{i-1}}(0|{\bf 0})P^{(0)}_{U_{i}|XU^{i-1}}(0|0,{\bf 0})$ , $p_{1}:=P^{(0)}_{X|U^{i-1}}(1|{\bf 0})P^{(0)}_{U_{i}|XU^{i-1}}(0|1,{\bf 0})$ , and we used $P^{(0)}_{X|U^{i-1}}(0|{\bf 0})=P^{(0)}_{X|U^{i-2}}(0|{\bf 0})$ which in turn follows from the factorization

	$\displaystyle P^{(0)}_{XYU_{i-1}\|U^{i-2}}$	$\displaystyle=P^{(0)}_{XY\|U^{i-2}}P^{(0)}_{U_{i-1}\|YU^{i-2}}$		(215)
		$\displaystyle=P^{(0)}_{X\|U^{i-2}}P^{(0)}_{Y\|U^{i-2}}P^{(0)}_{U_{i-1}\|YU^{i-2}}.$		(216)

Now from (212),

	$\displaystyle P^{(0)}_{U_{i}\|U^{i-1}}(0\|{\bf 0})$	$\displaystyle=P^{(0)}_{X\|U^{i-1}}(0\|{\bf 0})+\alpha_{i}^{-1}P^{(0)}_{X\|U^{i-1}}(1\|{\bf 0})$		(217)
		$\displaystyle=\frac{P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}{P_{X}(1)\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)}.$		(218)

Similarly, by switching the roles of $X$ and $Y$ we have

\displaystyle P^{(0)}_{U_{i-1}|U^{i-2}}(0|{\bf 0})

\displaystyle=\frac{P_{Y}(1)\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)}{P_{Y}(1)\prod_{2\leq j\leq i-3}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)}.

(219)

Therefore,

	$\displaystyle\quad P_{U^{i}}^{(0)}({\bf 0})$
	$\displaystyle=\prod_{1\leq j\leq i}P_{U_{i}\|{\bf U}^{i-1}}^{(0)}(0\|{\bf 0})$		(220)
	$\displaystyle=\left[P_{X}(1)\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+P_{X}(0)\right]\left[P_{Y}(1)\prod_{2\leq j\leq i}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)\right]$		(221)

for any $i=1,\dots,r$ . We also see from (212) and (221) that for any $i$ odd,

$\displaystyle P^{(0)}_{XU^{i-1}(0,{\bf 0})}$	$\displaystyle=P_{X}(0)\left[P_{Y}(1)\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+P_{Y}(0)\right]$	(222)
	$\displaystyle=\frac{1}{m_{1}}\left[(1-\frac{1}{m_{2}})\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}+\frac{1}{m_{2}}\right]$	(223)
	$\displaystyle\leq\frac{1.1}{m_{1}}\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}$	(224)

where the last step follows since $\prod_{2\leq j\leq i-1}^{\rm even}\alpha_{j}^{-1}\geq\frac{10}{m_{2}}$ . Therefore, the claim (56) follows, and (57) is similar.

Next, consider any $i\in\{1,\dots,r\}$ . Define

\displaystyle\delta_{i}:=\frac{P_{XY|U^{i-1}={\bf 0}}(0,0)}{P_{X|U^{i-1}={\bf 0}}(0)P_{Y|U^{i-1}={\bf 0}}(0)}-1.

(225)

Observe that the construction of $P_{U^{r}|XY}$ fulfills the Markov chain conditions (14)-(15), implying that

\displaystyle\frac{P^{(\delta)}_{XY}(0,0)P^{(\delta)}_{XY}(1,1)}{P^{(\delta)}_{XY}(0,1)P^{(\delta)}_{XY}(1,0)}=\frac{P^{(\delta)}_{XY|U^{i-1}}(0,0|{\bf 0})P^{(\delta)}_{XY|U^{i-1}}(1,1|{\bf 0})}{P^{(\delta)}_{XY|U^{i-1}}(0,1|{\bf 0})P^{(\delta)}_{XY|U^{i-1}}(1,0|{\bf 0})}.

(226)

We therefore have¹¹1We use the notation $\bar{x}:=1-x$ for $x\in[0,1]$ .

\displaystyle\frac{(1+\delta)(1+\frac{\delta}{(m_{1}-1)(m_{2}-1)})}{(1-\frac{\delta}{m_{1}-1})(1-\frac{\delta}{m_{2}-1})}=\frac{(1+\delta_{i})(1+\delta_{i}\frac{b^{(\delta)}c^{(\delta)}}{\bar{b}^{(\delta)}\bar{c}^{(\delta)}})}{(1-\delta_{i}\frac{b^{(\delta)}}{\bar{b}^{(\delta)}})(1-\delta_{i}\frac{c^{(\delta)}}{\bar{c}^{(\delta)}})}

(227)

where we defined

	$\displaystyle b^{(\delta)}$	$\displaystyle:=P_{X\|U^{i-1}}^{(\delta)}(0\|{\bf 0});$		(228)
	$\displaystyle c^{(\delta)}$	$\displaystyle:=P_{Y\|U^{i-1}}^{(\delta)}(0\|{\bf 0}).$		(229)

By continuity, we have

	$\displaystyle b^{(\delta)}$	$\displaystyle=b^{(0)}+o(1);$		(230)
	$\displaystyle c^{(\delta)}$	$\displaystyle=c^{(0)}+o(1),$		(231)

as $\delta\to 0$ . It is also easy to see from (227) that $\delta_{i}=O(\delta)$ (for this proof, only $\delta$ is the variable, and all other constants, such as $m$ and $(\alpha_{i})$ , can be hidden in the Landau notations). Therefore (227)(230)(231) yield

	$\displaystyle\quad 1+\left(1+\frac{1}{m_{1}-1}\right)\left(1+\frac{1}{m_{2}-1}\right)\delta+o(\delta)$
	$\displaystyle=1+\left(1+\frac{b^{(\delta)}}{\bar{b}^{(\delta)}}\right)\left(1+\frac{c^{(\delta)}}{\bar{c}^{(\delta)}}\right)\delta_{i}+o(\delta)$		(232)
	$\displaystyle=1+\left(1+\frac{b^{(0)}}{\bar{b}^{(0)}}\right)\left(1+\frac{c^{(0)}}{\bar{c}^{(0)}}\right)\delta_{i}+o(\delta).$		(233)

Using the fact that $X$ and $Y$ are independent under $P^{(0)}$ , noting (212) and the assumption $\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}^{-1}\geq\frac{10}{m_{1}}$ , we have

\displaystyle b^{(0)}=P_{X|U^{i^{\prime}}}^{(0)}(0|{\bf 0})\leq\frac{\frac{1}{m_{1}}}{\frac{10}{m_{1}}(1-\frac{1}{m_{1}})+\frac{1}{m_{1}}}\leq\frac{1}{10},

(234)

where $i^{\prime}$ is the largest odd integer not exceeding $i$ . Similarly we also have $c^{(0)}\leq\frac{1}{10}$ . Consequently, (233) yields

	$\displaystyle\delta_{i}^{2}$	$\displaystyle\geq\delta^{2}\left(\frac{(1+\frac{1}{m_{1}-1})(1+\frac{1}{m_{2}-1})}{(1+\frac{1}{9})^{2}}\right)^{2}+o(\delta^{2})$		(235)
		$\displaystyle\geq 0.9^{4}\delta^{2}+o(\delta^{2}).$		(236)

Moreover, let us define

\displaystyle a^{(\delta)}:=P^{(\delta)}_{U_{i}|U^{i-1}}(0|{\bf 0}).

(237)

In the following paragraph we consider arbitrary $i\in\{1,2,\dots,r\}\setminus 2\mathbb{Z}$ , and we shall omit the superscripts $(\delta)$ for $a^{(\delta)}$ , $b^{(\delta)}$ , $c^{(\delta)}$ , unless otherwise noted. Then

	$\displaystyle\quad I(U_{i};Y\|U^{i-1}={\bf 0})$
	$\displaystyle=aD(P_{Y\|U^{i}={\bf 0}}\\|P_{Y\|U^{i-1}={\bf 0}})$
	$\displaystyle\quad+\bar{a}D(P_{Y\|U_{i}=1,U^{i-1}={\bf 0}}\\|P_{Y\|U^{i-1}={\bf 0}})$		(238)

We can verify that $P_{X|U^{i}={\bf 0}}(0)=\frac{b}{b+\alpha_{i}^{-1}\bar{b}}$ . Therefore,

$\displaystyle P_{Y\|U^{i}={\bf 0}}(0)$	$\displaystyle=\frac{b}{b+\alpha_{i}^{-1}\bar{b}}\cdot c(1+\delta_{i-1})$
	$\displaystyle\quad+\frac{\alpha_{i}^{-1}\bar{b}}{b+\alpha_{i}^{-1}\bar{b}}\cdot\frac{\bar{b}c-\delta_{i-1}bc}{\bar{b}}$	(239)
	$\displaystyle=c+\frac{bc(1-\alpha_{i}^{-1})}{b+\alpha_{i}^{-1}\bar{b}}\delta_{i-1}.$	(240)

Therefore as $\delta\to 0$ ,

$\displaystyle D(P_{Y\|U^{i}={\bf 0}}\\|P_{Y\|U^{i-1}={\bf 0}})$	$\displaystyle=d\left(P_{Y\|U^{i}={\bf 0}}(0)\\|c\right)$	(241)
	$\displaystyle=\frac{1}{2}c\left(\frac{b(\alpha_{i}-1)}{\alpha_{i}b+\bar{b}}\delta_{i-1}\right)^{2}+o(\delta^{2})$	(242)
	$\displaystyle\geq\frac{1}{2}c\left(b(\alpha_{i}-1)0.9^{2}\delta\right)^{2}+o(\delta^{2})$	(243)
	$\displaystyle\geq\frac{0.9^{4}}{2}cb^{2}(\alpha_{i}-1)\delta^{2}+o(\delta^{2})$	(244)

where $d(p\|q):=p\log\frac{p}{q}+(1-p)\log\frac{1-p}{1-q}$ denotes the binary divergence function, and recall that we assumed the natural base of logarithms. On the other hand, $P_{Y|U_{i}=1,U^{i-1}={\bf 0}}(0)=P_{Y|X=1,U^{i-1}={\bf 0}}(0)=c-\frac{\delta_{i-1}bc}{1-b}$ . Therefore

$\displaystyle D(P_{Y\|U_{i}=1,U^{i-1}={\bf 0}}\\|P_{Y\|U^{i-1}={\bf 0}})$	$\displaystyle=d\left(c-\frac{\delta_{i-1}bc}{1-b}\\|c\right)$	(245)
	$\displaystyle=\frac{1}{2}c\left(\frac{\delta_{i-1}b}{1-b}\right)^{2}+o(\delta^{2})$	(246)
	$\displaystyle\geq\frac{0.9^{4}}{2}cb^{2}\delta^{2}+o(\delta^{2}).$	(247)

Turning back to (238), we obtain

	$\displaystyle\quad I(U_{i};Y\|U^{i-1}={\bf 0})$
	$\displaystyle=\frac{0.9^{4}}{2}\left[ab^{2}c(\alpha_{i}-1)^{2}\delta^{2}+(1-a)cb^{2}\delta^{2}\right]+o(\delta^{2})$		(248)
	$\displaystyle\geq\frac{0.9^{5}}{2}\left(\alpha_{i}^{-1}(\alpha_{i}-1)^{2}+1-\alpha_{i}^{-1}\right)b^{2}c\delta^{2}+o(\delta^{2})$		(249)
	$\displaystyle\geq\frac{0.9^{5}}{2}(\alpha_{i}-1)b^{2}c\delta^{2}+o(\delta^{2})$		(250)

where (249) follows since (218) implies

$\displaystyle a^{(\delta)}$	$\displaystyle=a^{(0)}+o(1)$	(251)
	$\displaystyle=\frac{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1)$	(252)
	$\displaystyle\geq\frac{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}^{-1}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}}+o(1)$	(253)
	$\displaystyle=\alpha_{i}^{-1}+o(1)$	(254)

and

$\displaystyle 1-a^{(\delta)}$	$\displaystyle=1-a^{(0)}+o(1)$	(255)
	$\displaystyle=\frac{(1-\frac{1}{m_{1}})(1-\alpha_{i}^{-1})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1)$	(256)
	$\displaystyle\geq\frac{(1-\frac{1}{m_{1}})(1-\alpha_{i}^{-1})\frac{10}{m_{1}}}{(1-\frac{1}{m_{1}})\frac{10}{m_{1}}+\frac{1}{m_{1}}}+o(1)$	(257)
	$\displaystyle\geq 0.9(1-\alpha_{i}^{-1})+o(1).$	(258)

Moreover, by (212),

$\displaystyle b^{(\delta)}$	$\displaystyle=b^{(0)}+o(1)$	(259)
	$\displaystyle=\frac{\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}})\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}+\frac{1}{m_{1}}}+o(1)$	(260)
	$\displaystyle\geq\frac{\frac{1}{m_{1}}}{(1-\frac{1}{m_{1}}+\frac{1}{10})\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}^{-1}}+o(1)$	(261)
	$\displaystyle\geq\frac{0.9}{m_{1}}\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}+o(1).$	(262)

Similarly,

\displaystyle c^{(\delta)}\geq\frac{0.9}{m_{2}}\prod_{1\leq j\leq i-1}^{\rm even}\alpha_{j}+o(1).

(263)

Therefore by (250) and (221),

$\displaystyle I(U_{i};Y\|U^{i-1})$	$\displaystyle=I(U_{i};Y\|U^{i-1}={\bf 0})P_{U^{i-1}}({\bf 0})$	(264)
	$\displaystyle\geq\frac{0.9^{5}}{2}(\alpha_{i}-1)b^{2}c\delta^{2}\prod_{1\leq j\leq i-1}\alpha_{j}^{-1}+o(\delta^{2})$	(265)
	$\displaystyle\geq\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}(\alpha_{i}-1)\prod_{1\leq j\leq i-1}^{\rm odd}\alpha_{j}+o(\delta^{2})$	(266)
	$\displaystyle=\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}\left(\prod_{1\leq j\leq i}^{\rm odd}\alpha_{j}-\prod_{1\leq j\leq i-2}^{\rm odd}\alpha_{j}\right)+o(\delta^{2})$	(267)

and hence

\displaystyle\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y|U^{i-1})

\displaystyle\geq\frac{0.9^{8}\delta^{2}}{2m_{1}^{2}m_{2}}\prod_{1\leq j\leq r}^{\rm odd}\alpha_{j}+o(\delta^{2}),

(268)

establishing the claim (58) of the theorem. The proof of (59) is similar.

Appendix B Proof of Theorem 6

We can choose the natural base of logarithms in this proof. Choose ${\bf U}=(U_{1},U_{2},\dots,U_{r})$ satisfying the Markov chain conditions (14)-(15) and so that

	$\displaystyle s_{\infty}^{*}(X;Y)$	$\displaystyle\leq 2\cdot\frac{I(X;Y)-I(X;Y\|\bf U)}{I({\bf U};X,Y)}$		(269)
		$\displaystyle\leq 4\cdot\frac{I(X;Y)-I(X;Y\|\bf U)}{I({\bf U};X)+I({\bf U};Y)}$		(270)

which is possible by the definition of $s_{\infty}^{*}(X;Y)$ .

Given $\alpha,\beta\in[0,1]$ , define by $P^{\alpha,\beta}$ the unique distribution²²2Alternatively, $P^{\alpha,\beta}$ equals the $I$ -projection $\operatorname*{arg\,min}_{Q_{XY}}D(Q_{XY}\|P_{XY})$ under the constraints $Q_{X}=[\alpha,\bar{\alpha}]$ and $Q_{Y}=[\beta,\bar{\beta}]$ [15, Corollary 3.3]. such that

\displaystyle P^{\alpha,\beta}(x,y)=P_{XY}(x,y)f(x)g(y)

(271)

for some functions $f$ and $g$ , and such that the marginals are $P^{\alpha}:=[\alpha,\bar{\alpha}]$ and $P^{\beta}:=[\beta,\bar{\beta}]$ . For the existence of $P^{\alpha,\beta}$ , see e.g. [23, 40]. Define $I(\alpha,\beta)$ as the mutual information of $(X,Y)$ under $P^{\alpha,\beta}$ . Define $\lambda=\lambda(\alpha,\beta)$ as the number such that $P^{\alpha,\beta}$ is the matrix

\displaystyle\left(\begin{array}[]{cc}\alpha\beta+\lambda&\alpha\bar{\beta}-\lambda\\ \bar{\alpha}\beta-\lambda&\bar{\beta}\bar{\beta}+\lambda\end{array}\right).

(274)

Given any $\bf u$ , let $\alpha_{\bf u}\in[0,1]$ be such that $P_{X|{\bf U=u}}=[\alpha_{\bf u},\bar{\alpha}_{\bf u}]$ . Define $\beta_{\bf u}$ similarly but for $P_{Y|{\bf U=u}}$ . With these notations, note that

\displaystyle\mathbb{E}[\alpha_{\bf U}]=\mathbb{E}[\beta_{\bf U}]=p;

(275)

and

	$\displaystyle I(X;Y)-I(X;Y\|{\bf U})=I(p,p)-\mathbb{E}[I(\alpha_{\bf U},\beta_{\bf U})];$		(276)
	$\displaystyle I({\bf U};X)+I({\bf U};Y)=\mathbb{E}[d(\alpha_{\bf U}\\|p)+d(\beta_{\bf U}\\|p)]$		(277)

where we recall that $d(\cdot\|\cdot)$ denotes the binary divergence function. Define

\displaystyle\psi(\alpha,\beta):=d(\alpha\|p)+d(\beta\|p).

(278)

Then note that $\psi(\alpha,\beta)$ is a smooth nonnegative function on $[0,1]^{2}$ with vanishing value and first derivatives at $(p,p)$ . Also, define

	$\displaystyle\quad\phi(\alpha,\beta):=$
	$\displaystyle I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)+I_{\beta}(p,p)(\beta-p)$		(279)

where $I_{\alpha}(p,p):=\left.\frac{\partial}{\partial\alpha}I(\alpha,\beta)\right|_{(p,p)}$ . Then $\phi$ is also a smooth function on $[0,1]^{2}$ with vanishing value and first derivatives at $(p,p)$ . Moreover, due to (275), we have

\displaystyle\mathbb{E}[\phi(\alpha_{\bf U},\beta_{\bf U})]=I(p,p)-\mathbb{E}[I(\alpha_{\bf U},\beta_{\bf U})].

(280)

Thus to prove the theorem it suffices to show the existence of sufficiently small $c>0$ , such that for any $p,|\delta|\in(0,c)$ , there is

\displaystyle\sup_{\alpha,\beta}\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\leq c^{-1}p\delta^{2}

(281)

where the sup is over $\alpha,\beta\in(0,1)$ .

•

Case 1: $0.1p<\alpha,\beta<10p$ .
Since $\frac{\partial^{2}}{\partial\alpha^{2}}D(\alpha\|p)=\left[\frac{1}{\alpha}+\frac{1}{1-\alpha}\right]\geq\frac{1}{\alpha}\geq\frac{1}{10p}$ for $\alpha\in[0,10p]$ , we have

\displaystyle\psi(\alpha,\beta)\geq\frac{1}{20p}[(\alpha-p)^{2}+(\beta-p)^{2}]

(282)

for $(\alpha,\beta)\in[0,10p]^{2}$ . Now if we can show that

	$\displaystyle\sup_{(\alpha,\beta)\in[0,10p]^{2}}\\|\partial^{2}\phi(\alpha,\beta)\\|$	$\displaystyle=\sup_{(\alpha,\beta)\in[0,10p]^{2}}\\|\partial^{2}I(\alpha,\beta)\\|$		(283)
		$\displaystyle\lesssim\delta^{2},$		(284)

we will obtain $\sup_{(\alpha,\beta)\in[0,10p]^{2}}\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim p\delta^{2}$ which matches (281). Here and below, $x\lesssim y$ means that there is an absolute constant $C>0$ such that $x\leq Cy$ when $c$ in the theorem statement (and hence $p$ and $|\delta|$ ) is sufficiently small.

Before explicitly computing $\partial^{2}\phi(\alpha,\beta)$ , we give some intuitions why we should expect (284) to be true. For fixed $\alpha,\beta,p$ , we will show that

\displaystyle I(\alpha,\beta)=\tilde{I}(\alpha,\beta)+o(\delta^{2})

(285)

as $\delta\to 0$ , where we defined

\displaystyle\tilde{I}(\alpha,\beta):=\frac{\delta^{2}}{2\bar{p}^{4}}\alpha\bar{\alpha}\beta\bar{\beta}.

(286)

If the difference between $I(\alpha,\beta)$ and $\tilde{I}(\alpha,\beta)$ could be neglected, then (284) should hold. To see (285), for given $\alpha,\beta\in(0,1)$ , note that (271) implies,

\displaystyle\frac{(1+\frac{\lambda}{\alpha\beta})(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}})}{(1-\frac{\lambda}{\alpha\bar{\beta}})(1-\frac{\lambda}{\bar{\alpha}\beta})}=\frac{(1+\delta)(1+\frac{\delta p^{2}}{\bar{p}^{2}})}{(1-\frac{\delta p}{\bar{p}})^{2}}.

(287)

Under the assumption $\delta\to 0$ , the above linearizes to

\displaystyle\frac{\lambda}{\alpha\bar{\alpha}\beta\bar{\beta}}=\frac{\delta}{\bar{p}^{2}}+o(\delta).

(288)

Moreover, note that

	$\displaystyle D_{\chi^{2}}(P^{\alpha,\beta}\\|P^{\alpha}\times P^{\beta})$	$\displaystyle=\frac{\lambda^{2}}{2\alpha\bar{\alpha}\beta\bar{\beta}}$		(289)
		$\displaystyle=\frac{\delta^{2}}{2\bar{p}^{4}}\alpha\bar{\alpha}\beta\bar{\beta}+o(\delta^{2})$		(290)

where the last step follows by comparing with (288). Since $I(\alpha,\beta)/D_{\chi^{2}}(P^{\alpha,\beta}\|P^{\alpha}\times P^{\beta})\to 1$ as $\delta\to 0$ , we see (285) holds. Of course, (285) does not really show (284) since approximation of function values generally does not imply approximation of the derivatives. However, we shall next explicitly take the derivatives to give a real proof, and the above observations are useful guides.

First, note that

	$\displaystyle\frac{\partial I(\alpha,\beta)}{\partial\alpha}$
	$\displaystyle=\sum_{x,y\in\{0,1\}}\left(\frac{\partial}{\partial\alpha}P^{\alpha,\beta}(x,y)\right)\ln\frac{P^{\alpha,\beta}(x,y)}{P^{\alpha}(x)P^{\beta}(x)}$		(291)
	$\displaystyle=(\beta+\lambda_{\alpha})\ln(1+\frac{\lambda}{\alpha\beta})+(\bar{\beta}-\lambda_{\alpha})\ln(1-\frac{\lambda}{\alpha\bar{\beta}})$
	$\displaystyle\quad(-\beta-\lambda_{\alpha})\ln(1-\frac{\lambda}{\bar{\alpha}\beta})+(-\bar{\beta}+\lambda_{\alpha})\ln(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}).$		(292)

where $\lambda_{\alpha}:=\frac{\partial}{\partial\alpha}\lambda$ . Next, we express the first and second derivatives, $\lambda_{\alpha}$ , $\lambda_{\beta}$ and $\lambda_{\alpha,\beta}$ , in terms of $\lambda$ . Differentiating the logarithm of (287) in $\beta$ yields

	$\displaystyle\quad\lambda_{\beta}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}\right]$
	$\displaystyle=\lambda\left[\frac{\beta^{-1}}{\alpha\beta+\lambda}-\frac{\bar{\beta}^{-1}}{\bar{\alpha}\bar{\beta}+\lambda}-\frac{\bar{\beta}^{-1}}{\alpha\bar{\beta}-\lambda}+\frac{\beta^{-1}}{\bar{\alpha}\beta-\lambda}\right].$		(293)

In the rest of the proof the notation $f(t)=O(t)$ means $|f(t)|\lesssim|t|$ (recall the definition of $\lesssim$ in (284)), and $f(t)=\Theta(t)$ if $1\lesssim f(t)/t\lesssim 1$ . Note that for $0.1p<\alpha,\beta<10p$ we have

\displaystyle\frac{\lambda}{\alpha\beta}=\Theta(\delta),

(294)

since the right side of (287) clearly equals $1+\Theta(\delta)$ . Then by (293),

\displaystyle\lambda_{\beta}\left[\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+O(\frac{\lambda}{\alpha^{2}\beta^{2}})\right]=\lambda\left[\frac{\bar{\beta}-\beta}{\alpha\bar{\alpha}\beta^{2}\bar{\beta}^{2}}+O(\frac{\lambda}{\alpha^{2}\beta^{3}})\right]

(295)

and hence,

\displaystyle\lambda_{\beta}=\lambda\cdot\frac{\bar{\beta}-\beta}{\beta\bar{\beta}}\left(1+O(\frac{\lambda}{\alpha\beta})\right)=O(\frac{\lambda}{\beta}).

(296)

Expression of $\lambda_{\alpha}$ can be found similarly. Moreover, differentiating (293) we get

	$\displaystyle\lambda_{\alpha,\beta}\left[\tfrac{1}{\alpha\beta+\lambda}+\tfrac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\tfrac{1}{\alpha\bar{\beta}-\lambda}+\tfrac{1}{\bar{\alpha}\beta-\lambda}\right]$
	$\displaystyle+\lambda_{\alpha}\lambda_{\beta}\left[-\tfrac{1}{(\alpha\beta+\lambda)^{2}}-\tfrac{1}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{1}{(\alpha\bar{\beta}-\lambda)^{2}}+\tfrac{1}{(\bar{\alpha}\beta-\lambda)^{2}}\right]$
	$\displaystyle+\lambda_{\alpha}\left[-\tfrac{\alpha}{(\alpha\beta+\lambda)^{2}}+\tfrac{\bar{\alpha}}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{\alpha}{(\alpha\bar{\beta}-\lambda)^{2}}-\tfrac{\bar{\alpha}}{(\bar{\alpha}\beta-\lambda)^{2}}\right]$
	$\displaystyle+\lambda_{\beta}\left[-\tfrac{\beta}{(\alpha\beta+\lambda)^{2}}+\tfrac{\bar{\beta}}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}+\tfrac{\beta}{(\bar{\alpha}\beta-\lambda)^{2}}-\tfrac{\bar{\beta}}{(\alpha\bar{\beta}-\lambda)^{2}}\right]$
	$\displaystyle+\lambda\left[\tfrac{1}{(\alpha\beta+\lambda)^{2}}+\tfrac{1}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}-\tfrac{1}{(\alpha\bar{\beta}-\lambda)^{2}}-\tfrac{1}{(\bar{\alpha}\beta-\lambda)^{2}}\right]=0,$		(297)

from which we can deduce that

\displaystyle\lambda_{\alpha,\beta}=O(\frac{\lambda}{\alpha\beta}).

(298)

Now, taking the derivative in (292), we obtain

	$\displaystyle\quad\partial_{\alpha,\beta}I(\alpha,\beta)$
	$\displaystyle=\lambda_{\alpha,\beta}\lambda\cdot\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+\lambda_{\alpha}\lambda_{\beta}\cdot\frac{1}{\alpha\bar{\alpha}\beta\bar{\beta}}+\lambda_{\alpha}\lambda\cdot\frac{\beta-\bar{\beta}}{\alpha\bar{\alpha}\beta^{2}\bar{\beta}^{2}}$
	$\displaystyle\quad+\lambda_{\beta}\lambda\cdot\frac{\alpha-\bar{\alpha}}{\alpha^{2}\bar{\alpha}^{2}\beta\bar{\beta}}+\frac{\lambda^{2}}{2}\cdot\frac{(\bar{\alpha}-\alpha)(\bar{\beta}-\beta)}{\alpha^{2}\bar{\alpha}^{2}\beta^{2}\bar{\beta}^{2}}$
	$\displaystyle\quad+O\left(\frac{\lambda^{3}}{\alpha^{3}\beta^{3}}\right).$		(299)

In deriving (299), we applied the Taylor expansions of $x\mapsto\ln(1+x)$ and $x\mapsto\frac{1}{1+x}$ . Plugging (296) and (298) into (299), we obtain

\displaystyle|\partial_{\alpha,\beta}I(\alpha,\beta)|=O(\left(\frac{\lambda}{\alpha\beta}\right)^{2})=O(\delta^{2}).

(300)

Next, we control $|\partial_{\alpha,\alpha}I(\alpha,\beta)|$ . Similarly to (293), we have

	$\displaystyle\quad\lambda_{\alpha}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}\right]$
	$\displaystyle=\lambda\left[\frac{\alpha^{-1}}{\alpha\beta+\lambda}-\frac{\bar{\alpha}^{-1}}{\bar{\alpha}\bar{\beta}+\lambda}-\frac{\bar{\alpha}^{-1}}{\bar{\alpha}\beta-\lambda}+\frac{\alpha^{-1}}{\alpha\bar{\beta}-\lambda}\right].$		(301)

Further taking the derivative, we obtain

	$\displaystyle\quad\lambda_{\alpha,\alpha}\left[\frac{1}{\alpha\beta+\lambda}+\frac{1}{\bar{\alpha}\bar{\beta}+\lambda}+\frac{1}{\bar{\alpha}\beta-\lambda}+\frac{1}{\alpha\bar{\beta}-\lambda}\right]$
	$\displaystyle+\lambda_{\alpha}\left[\frac{-2\beta-\alpha^{-1}\lambda}{(\alpha\beta+\lambda)^{2}}+\frac{2\bar{\beta}+\bar{\alpha}^{-1}\lambda}{(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.$
	$\displaystyle\quad\quad+\left.\frac{2\beta-\bar{\alpha}^{-1}\lambda}{(\bar{\alpha}\beta-\lambda)^{2}}+\frac{-2\bar{\beta}+\alpha^{-1}\lambda}{(\alpha\bar{\beta}-\lambda)^{2}}\right]$
	$\displaystyle+\lambda^{2}\left[\frac{1}{\alpha^{2}(\alpha\beta+\lambda)^{2}}+\frac{1}{\bar{\alpha}^{2}(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.$
	$\displaystyle\quad\quad\left.-\frac{1}{\bar{\alpha}^{2}(\bar{\alpha}\beta-\lambda)^{2}}-\frac{1}{\alpha^{2}(\alpha\bar{\beta}-\lambda)^{2}}\right]$
	$\displaystyle+2\lambda\left[\frac{\beta}{\alpha(\alpha\beta+\lambda)^{2}}+\frac{\bar{\beta}}{\bar{\alpha}(\bar{\alpha}\bar{\beta}+\lambda)^{2}}\right.$
	$\displaystyle\quad\quad\left.+\frac{\beta}{\bar{\alpha}(\bar{\alpha}\beta-\lambda)^{2}}+\frac{\bar{\beta}}{\alpha(\alpha\bar{\beta}-\lambda)^{2}}\right]=0.$		(302)

Next, we shall use the assumption of $\alpha,\beta\in(0.1p,10p)$ to simplify (302) as

\displaystyle\lambda_{\alpha,\alpha}\cdot\Theta(\frac{1}{p^{2}})-\lambda_{\alpha}\cdot\Theta(\frac{1}{p^{3}})+\lambda^{2}\cdot\Theta(\frac{1}{p^{6}})+\lambda\cdot\Theta(\frac{1}{p^{4}})=0.

(303)

Since, analogous to (296), we have

\displaystyle\lambda_{\alpha}=O(\frac{\lambda}{\alpha}),

(304)

we see that (303) implies

\displaystyle\lambda_{\alpha,\alpha}=O(\frac{\lambda}{p^{2}}).

(305)

Tighter estimates of $\lambda_{\alpha,\alpha}$ are possible, but the above will suffice. We now take the derivative of (292) in $\alpha$ :

\displaystyle\partial_{\alpha,\alpha}I(\alpha,\beta)=I_{1}+I_{2}

(306)

where

	$\displaystyle I_{1}:$	$\displaystyle=\lambda_{\alpha,\alpha}\left[\ln(1+\frac{\lambda}{\alpha\beta})-\ln(1-\frac{\lambda}{\alpha\bar{\beta}})\right.$
		$\displaystyle\quad\quad\left.-\ln(1-\frac{\lambda}{\bar{\alpha}\beta})+\ln(1+\frac{\lambda}{\bar{\alpha}\bar{\beta}})\right]$		(307)

and

$\displaystyle I_{2}$	$\displaystyle:=\lambda_{\alpha}^{2}\left(\frac{\frac{1}{\alpha\beta}}{1+\frac{\lambda}{\alpha\beta}}+\frac{\frac{1}{\alpha\bar{\beta}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}\beta}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{\frac{1}{\bar{\alpha}\bar{\beta}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)$
	$\displaystyle+\lambda_{\alpha}\left(\frac{\frac{1}{\alpha}}{1+\frac{\lambda}{\alpha\beta}}-\frac{\frac{1}{\alpha}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}}}{1-\frac{\lambda}{\bar{\alpha}\beta}}-\frac{\frac{1}{\bar{\alpha}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)$
	$\displaystyle+\lambda_{\alpha}\lambda\left(\frac{-\frac{1}{\alpha^{2}\beta}}{1+\frac{\lambda}{\alpha\beta}}+\frac{-\frac{1}{\alpha^{2}\bar{\beta}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}^{2}\beta}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{\frac{1}{\bar{\alpha}^{2}\bar{\beta}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)$
	$\displaystyle+\lambda\left(\frac{-\frac{1}{\alpha^{2}}}{1+\frac{\lambda}{\alpha\beta}}+\frac{\frac{1}{\alpha^{2}}}{1-\frac{\lambda}{\alpha\bar{\beta}}}+\frac{\frac{1}{\bar{\alpha}^{2}}}{1-\frac{\lambda}{\bar{\alpha}\beta}}+\frac{-\frac{1}{\bar{\alpha}^{2}}}{1+\frac{\lambda}{\bar{\alpha}\bar{\beta}}}\right)$	(308)

We can Taylor expand $I_{1}$ using the facts that $\lambda_{\alpha,\alpha}=O(\frac{\lambda}{p^{2}})$ , $\alpha,\beta=\Theta(p)$ , to obtain

\displaystyle I_{1}=O(\frac{\lambda^{2}}{p^{4}}).

(309)

We can Taylor expand $I_{2}$ using the fact that $\lambda_{\alpha}=O(\frac{\lambda}{p})$ (analogous to (296)) to obtain

\displaystyle I_{2}=O(\frac{\lambda^{2}}{p^{4}}).

(310)

Thus

\displaystyle|\partial_{\alpha,\alpha}I(\alpha,\beta)|=O(\frac{\lambda^{2}}{p^{4}})=O(\delta^{2}).

(311)

By symmetry same bound holds for $|\partial_{\beta,\beta}I(\alpha,\beta)|$ as well. Together with (300), we thus validated (284), and consequently (281) in this case.

•

Case 2: $\max\{\alpha,\beta\}\geq 10p$ .
Without loss of generality assume that $\alpha\geq\beta$ and $\alpha\geq 10p$ . From (292), we have

	$\displaystyle\quad\partial_{\alpha}I(p,p)$
	$\displaystyle=(p+\lambda_{\alpha}(p,p))\ln(1+\delta)+(\bar{p}-\lambda_{\alpha}(p,p))\ln(1-\delta p/\bar{p})$
	$\displaystyle\quad+(-p-\lambda_{\alpha}(p,p))\ln(1-\delta p/\bar{p})$
	$\displaystyle\quad+(-\bar{p}+\lambda_{\alpha}(p,p))\ln(1+\delta p^{2}/\bar{p}^{2}).$		(312)

Using (304) with $\lambda\leftarrow\delta p^{2}$ and $\alpha\leftarrow p$ , we obtain $\lambda_{\alpha}(p,p)=O(p\delta)$ . Thus Taylor expanding the above displayed, we obtain

\displaystyle\partial_{\alpha}I(p,p)\lesssim p\delta^{2}.

(313)

Then

$\displaystyle\phi(\alpha,\beta)$	$\displaystyle:=I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)$
	$\displaystyle\quad+I_{\beta}(p,p)(\beta-p)$	(314)
	$\displaystyle\leq p^{2}\delta^{2}-0+2I_{\alpha}(p,p)(\alpha-p)$	(315)
	$\displaystyle\lesssim p\alpha\delta^{2}$	(316)

where we used the assumption that $\alpha\geq\beta$ and the fact that $I_{\alpha}(p,p)=I_{\beta}(p,p)$ . Now the assumption of $\alpha\geq 10p$ implies

\displaystyle\psi(\alpha,\beta)\geq d(\alpha\|p)\gtrsim\alpha.

(317)

To see the second inequality in (317), note that

	$\displaystyle\min_{\alpha\in(10p,1]}\frac{1}{\alpha}d(\alpha\\|p)$	$\displaystyle=\min_{\alpha\in[10p,1]}\left\{\ln\frac{\alpha}{p}+\frac{1-\alpha}{\alpha}\ln\frac{1-\alpha}{1-p}\right\}$		(318)
		$\displaystyle=\frac{d(10p\\|p)}{10p}$		(319)

where the minimization is easily solved by checking that the derivative is positive for $\alpha\geq 10p$ . Finally, combining (317) with (316), we obtain $\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim\delta^{2}p$ , as desired.

•

Case 3: $\min\{\alpha,\beta\}\leq 0.1p$ , $\max\{\alpha,\beta\}<10p$ .
Assume without loss of generality that $\alpha\leq 0.1p$ . In this case, using (313), we have

$\displaystyle\phi(\alpha,\beta)$	$\displaystyle:=I(p,p)-I(\alpha,\beta)+I_{\alpha}(p,p)(\alpha-p)$
	$\displaystyle\quad+I_{\beta}(p,p)(\beta-p)$	(320)
	$\displaystyle\leq p^{2}\delta^{2}-0+I_{\alpha}(p,p)(\alpha+\beta-2p)$	(321)
	$\displaystyle\leq p^{2}\delta^{2}+I_{\alpha}(p,p)\cdot 18p$	(322)
	$\displaystyle=O(p^{2}\delta^{2}).$	(323)

On the other hand,

$\displaystyle\psi(\alpha,\beta)$	$\displaystyle\geq d(\alpha\\|p)$	(324)
	$\displaystyle\geq d(0.1p\\|p)$	(325)
	$\displaystyle=(0.9-0.1\ln 10)p+O(p^{2})$	(326)
	$\displaystyle=\Theta(p).$	(327)

Thus we once again obtain $\frac{\phi(\alpha,\beta)}{\psi(\alpha,\beta)}\lesssim\delta^{2}p$ , as desired.

References

[1] Jayadev Acharya, Clement Canonne, Aditya Vikram Singh, and Himanshu Tyagi. Optimal rates for nonparametric density estimation under communication constraints. Advances in Neural Information Processing Systems, 34, 2021.
[2] Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints II: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 66(12):7856–7877, 2020.
[3] Rudolf Ahlswede and Marat V Burnashev. On minimax estimation in the presence of side information about remote data. Annals of Statistics, pages 141–171, 1990.
[4] Rudolf Ahlswede and Imre Csiszár. Hypothesis testing with communication constraints. IEEE Transactions on Information Theory, 32(4), 1986.
[5] Rudolf Ahlswede and Peter Gács. Spreading of sets in product spaces and hypercontraction of the Markov operator. The Annals of Probability, 4:925–939, 1976.
[6] Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv:1304.6133, 2013.
[7] Leighton Pate Barnes, Yanjun Han, and Ayfer Özgür. Lower bounds for learning distributions under communication constraints via fisher information. The Journal of Machine Learning Research, 21(1):9583–9612, 2020.
[8] Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3):1352, 2018.
[9] M. Braverman and A. Rao. Information equals amortized communication. In FOCS, pages 748–757, 2011.
[10] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020, 2016.
[11] T. Tony Cai and Hongji Wei. Distributed Gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms. arXiv:2001.08877, 2020.
[12] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming distance. SIAM Journal on Computing, 41(5):1299–1317, 2012.
[13] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao. Informational complexity and the direct sum problem for simultaneous message complexity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 270–278. IEEE, 2001.
[14] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. 2nd Edition, Wiley, July 2006.
[15] I. Csiszar. I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1):146–158, February, 1975.
[16] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
[17] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.
[18] Peter Kairouz et al. Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning Series. 2021.
[19] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Communication complexity of estimating correlations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 792–803, 2019.
[20] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Error exponents in distributed hypothesis testing of correlations. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2674–2678, 2019.
[21] Yanjun Han, Pritam Mukherjee, Ayfer Ozgur, and Tsachy Weissman. Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 506–510. IEEE, 2018.
[22] Yanjun Han, Ayfer Özgür, and Tsachy Weissman. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pages 3163–3188. PMLR, 2018.
[23] Charles Hobby and Ronald Pyke. Doubly stochastic operators obtained from positive operators. Pacific Journal of Mathematics, 15(1):153–157, 1965.
[24] Piotr Indyk and David Woodruff. Tight lower bounds for the distinct elements problem. In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pages 283–288, 2003.
[25] Amiram Kaspi. Two-way source coding with a fidelity criterion. IEEE Transactions on Information Theory, 31(6):735–740, 1985.
[26] Eyal Kushilevitz and Noam Nisan. Communication Complexity. Cambridge University Press, 1997.
[27] Jingbo Liu. Information Theory from A Functional Viewpoint. Ph.D. thesis, Princeton University, Princeton, NJ, 2018.
[28] Jingbo Liu. Communication complexity of two-party nonparametric global density estimation. In 56th Annual Conference on Information Sciences and Systems (CISS), 2022.
[29] Jingbo Liu. Interaction improves two-party nonparametric pointwise density estimation. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 904–909, 2022.
[30] Jingbo Liu, Thomas A Courtade, Paul Cuff, and Sergio Verdú. Smoothing Brascamp-Lieb inequalities and strong converses for common randomness generation. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1043–1047.
[31] Jingbo Liu, Paul Cuff, and Sergio Verdú. Key generation with limited interaction. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2918–2922. IEEE, 2016.
[32] Jingbo Liu, Paul Cuff, and Sergio Verdú. Secret key generation with limited interaction. IEEE Transactions on Information Theory, 63(11):7358–7381, 2017.
[33] Nan Ma and Prakash Ishwar. Some results on distributed source coding for interactive function computation. IEEE Transactions on Information Theory, 57(9):6180–6195, 2011.
[34] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
[35] Noam Nisan and Avi Widgerson. Rounds in communication complexity revisited. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 419–429, 1991.
[36] Alon Orlitsky. Worst-case interactive communication. I. two messages are almost optimal. IEEE Transactions on Information Theory, 36(5):1111–1126, 1990.
[37] Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
[38] K. R. Sahasranand and Himanshu Tyagi. Extra samples can reduce the communication for independence testing. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 2316–2320, 2018.
[39] Igal Sason and Sergio Verdú. $f$ -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
[40] Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
[41] Charles J Stone. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, pages 1348–1360, 1980.
[42] Madhu Sudan, Badih Ghazi, Noah Golowich, and Mitali Bafna. Communication-rounds tradeoffs for common randomness and secret key generation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1861–1871. SIAM, 2019.
[43] Madhu Sudan, Himanshu Tyagi, and Shun Watanabe. Communication for generating correlation: A unifying survey. IEEE Transactions on Information Theory, 66(1):5–37, 2019.
[44] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
[45] Paxton Turner, Jingbo Liu, and Philippe Rigollet. Efficient interpolation of density estimators. In International Conference on Artificial Intelligence and Statistics, pages 2503–2511. PMLR, 2021.
[46] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, pages 2512–2520. PMLR, 2021.
[47] Himanshu Tyagi. Common information and secret key capacity. IEEE Transactions on Information Theory, 59(9):5627–5640, 2013.
[48] Yu Xiang and Young-Han Kim. Interactive hypothesis testing against independence. In 2013 IEEE International Symposium on Information Theory, pages 2840–2844, 2013.
[49] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
[50] Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213, 1979.
[51] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, pages 423–435, 1996.
[52] Yuchen Zhang, John C Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328–2336, 2013.
[53] Yuancheng Zhu and John Lafferty. Distributed nonparametric regression under communication constraints. In International Conference on Machine Learning, pages 6009–6017, 2018.
[54] Yuancheng Zhu and John Lafferty. Quantized minimax estimation over Sobolev ellipsoids. Information and Inference: A Journal of the IMA, 7(1):31–82, 2018.

Jingobo Liu received the B.S. degree in Electrical Engineering from Tsinghua University, Beijing, China in 2012, and the M.A. and Ph.D. degrees in Electrical Engineering from Princeton University, Princeton, NJ, USA, in 2014 and 2017. He was a Norbert Wiener Postdoctoral Research Fellow at the MIT Institute for Data, Systems, and Society (IDSS) during 2018-2020. Since 2020, he has been an assistant professor in the Department of Statistics and an affiliate in the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign, IL, USA. His research interests include information theory, high dimensional statistics and probability, coding theory, and the related fields. His undergraduate thesis received the best undergraduate thesis award at Tsinghua University (2012). He gave a semi-plenary presentation at the 2015 IEEE Int. Symposium on Information Theory, Hong-Kong, China. He was a recipient of the Princeton University Wallace Memorial Honorific Fellowship in 2016. His Ph.D. thesis received the Bede Liu Best Dissertation Award of Princeton and the Thomas M. Cover Dissertation Award of the IEEE Information Theory Society (2018).

	$\displaystyle\quad\sum_{1\leq i\leq r}^{\rm odd}I(U_{i};Y\|U^{i-1})+\sum_{1\leq i\leq r}^{\rm even}I(U_{i};X\|U^{i-1})$
	$\displaystyle=I(X;Y)-I(X;Y\|U^{r})$		(16)
	$\displaystyle=I(U^{r};XY)-[I(U^{r};X\|Y)+I(U^{r};Y\|X)]$		(17)

A Few Interactions Improve Distributed Nonparametric Estimation, Optimally

Abstract

Index Terms:

I Introduction

II Preliminaries

II-A Notation

II-B Nonparametric Estimation

II-C Strong and Symmetric Data Processing Constants

Definition 1.

Definition 2.

II-D Testing Against Independence

Lemma 1.

III Problem Setup and Main Results

Definition 3.

Definition 4.

Remark 1.

Remark 2.

Remark 3.

Theorem 1.

Theorem 2.

IV Estimation of Biased Bernoulli Distributions

Definition 5.

Lemma 2.

Proof.

Lemma 3.

Proof.

Theorem 3.

Proof.

V Bounds on Information Exchanges

V-A General (αi)(\alpha_{i})

Theorem 4.

Remark 4.

Remark 5.

V-B r=1r=1 Case

Corollary 4.

V-C r=∞r=\infty Case

Corollary 5.

Conjecture 1.

VI Achievability Bounds for Bernoulli Estimation

VI-A Communication Complexity

VI-B Expectation of ΓB\Gamma^{\rm B}

Lemma 6.

Proof.

VI-C Variance of ΓB\Gamma^{\rm B}

VI-D r=1r=1 Case

Corollary 7.

Proof.

VI-E r=∞r=\infty Case

Corollary 8.

Proof.

VII Density Estimation Upper Bounds

VII-A Density Lower Bound Assumption

Lemma 9.

Proof.

VII-B One-Way Case

VII-C Interactive Case

VII-D Extension to β>2\beta>2

Proposition 10.

Proof.

VIII One-way Density Estimation Lower Bound

VIII-A Upper Bounding s∗​(X;Y)s^{*}(X;Y)

Theorem 5.

Proof.

VIII-B Lower Bounding One-Way NP Estimation Risk

IX Interactive Density Estimation Lower Bound

IX-A Upper Bounding s∞∗​(X;Y)s^{*}_{\infty}(X;Y)

Theorem 6.

IX-B Lower Bounding Interactive NP Estimation Risk

X Acknowledgement

Appendix A Proof of Theorem 4

Appendix B Proof of Theorem 6

References

V-A General $(\alpha_{i})$

V-B $r=1$ Case

V-C $r=\infty$ Case

VI-B Expectation of $\Gamma^{\rm B}$

VI-C Variance of $\Gamma^{\rm B}$

VI-D $r=1$ Case

VI-E $r=\infty$ Case

VII-D Extension to $\beta>2$

VIII-A Upper Bounding $s^{*}(X;Y)$

IX-A Upper Bounding $s^{*}_{\infty}(X;Y)$