A Few Interactions Improve Distributed Nonparametric Estimation, Optimally
Abstract
Consider the problem of nonparametric estimation of an unknown -Hölder smooth density at a given point, where and are both dimensional. An infinite sequence of i.i.d. samples are generated according to this distribution, and two terminals observe and , respectively. They are allowed to exchange bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order for one-way protocols and for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of . The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables, and the lower bound is built on a sharp estimate of a symmetric strong data processing constant for biased Bernoulli variables.
Index Terms:
Density estimation, Communication complexity, Nonparametric statistics, Learning with system constraints, Strong data processing constantI Introduction
The communication complexity problem was introduced in the seminal paper of Yao [50] (see also [26] for a survey), where two terminals (which we call Alice and Bob) compute a given Boolean function of their local inputs ( and ) by means of exchanging messages. The famous log-rank conjecture provides an estimate of the communication complexity of a general Boolean function, which is still open to date. Meanwhile, communication complexity of certain specific functions can be better understood. For example, the Gap-Hamming problem [24][12] concerns testing against , where denotes the sample correlation and . It was shown in [12] with a geometric argument that the communication complexity (for worst-case deterministic ) is , therefore a one-way protocol where Alice simply sends cannot be improved (up to a multiplicative constant) by an interactive protocol.
Gap-Hamming is closely related to the problem of estimating the joint distribution of a pair of binary or Gaussian random variables (using i.i.d. samples). Indeed, for large we may assume that Alice (resp. Bob) can estimate the marginal distributions of (resp. ) very well, so that the joint distribution is parameterized by only one scalar which is the correlation. An information-theoretic proof of Gap-Hamming was previously provided in [19], building on a converse for correlation estimation for the binary symmetric distribution, and pinned down the exact prefactor in the risk-communication tradeoff. In particular, the result of [19] implies that the naive algorithm where Alice simply sends can be improved by a constant factor in the estimation risk by a more sophisticated scheme using additional samples. For the closely related problem of correlation (distribution) testing, [38] and [48] provided asymptotically tight bounds on the communication complexity under the one-way and interactive protocols when the null hypothesis is the independent distribution (zero correlation), which also implies that the error exponent can be improved by an algorithm using additional samples. The technique of [48] is based on the tensorization of internal and external information ((20) ahead), whereas the bound of [38] uses hypercontractivity. More recently, [20] derived bounds for testing against dependent distributions using optimal transport inequalities.
In this paper, we take the natural step of introducing nonparametric (NP) statistics to Alice and Bob, whereby two parties estimate a nonparametric density by means of sending messages interactively. It will be seen that this problem is closely related to a “sparse” version of the aforementioned Gap-Hamming problem, where interaction does help, in contrast to the usual Gap-Hamming problem.
For concreteness, consider the problem of nonparametric estimation of an unknown -Hölder smooth density at a given point . For simplicity we assume the symmetric case where and are both dimensional. An infinite sequence of i.i.d. samples are generated according to , and Alice and Bob observe and , respectively. After they exchange bits (either in one-way or interactively), Bob estimates the unknown density at the given point. We successfully characterize the minimax rate in terms of the communication complexity : it is order for one-way protocols and for interactive protocols.
Notably, allowing interaction strictly improves the estimation risk. Previously, separations between one-way and interactive protocols are known but in very different contexts: In [32, Corollary 1] (see also [31]), the separation was found in the rate region of common randomness generation from biased binary distributions, using certain convexity arguments, but this only implies a difference in the leading constant, rather than the asymptotic scaling. On the other hand, the example distribution in [42] is based on the pointer-chasing construction of [35], which appears to be a highly artificial distribution designed to entail a separation between the one-way and interactive protocols. Another example where interaction improves zero-error source coding with side information, based on a “bit location” algorithm, was described in [36], and it was shown that two-way communication complexity differs from interactivity communication complexity only by constant factors. In contrast, the logarithmic separation in the present paper arises from the nonparametric nature of the problem: If we consider the problem of correlation estimation for Bernoulli pairs with a fixed bias (a parametric problem), the risk will be order , and there will be no separation between one-way and interactive protocols (which is indeed the case in [19]). In contrast, nonparametric estimation is analogous to Bernoulli correlation estimation where the bias changes with (since the optimal bandwidth adapts to ), which gives rise to the separation.
For the risk upper bound, in the one-way setting it is efficient for Alice to just encode the set of ’s such that falls within a neighborhood (computed by the optimal bandwidth for a given ) of the given point . To achieve the optimal rate for interactive protocols, we provide a novel scheme that uses rounds of interactions, where grows as slowly as the super logarithm (i.e. the inverse of tetration) of . With the sequence of we use in Section V-C (and suppose that ), while interactions is barely enough for equal to the number of letters in a short sentence, is more than sufficient for equal to the number of all elementary particles in the entire observable universe. Thus from a practical perspective, is effectively a constant, although it remains an interesting theoretical question whether really diverges (Conjecture 1).
For the lower bound, the proof is based on the symmetric data processing constant introduced in [32]. Previously, the data processing constant has been connected to two-party estimation and hypothesis testing in [19]; the idea was canonized as the following statement: “Information for hypothesis testing locally” is upper bounded by times “Information communicated mutually”. However, is not easy to compute, and previous bounds on are also not tight enough for our purpose. Instead, we first use an idea of simulation of continuous variables to reduce the problem to estimation of Bernoulli distributions, for which is easier to analyze. Then we use some new arguments to bound .
Let us emphasize that this paper concerns density estimation at a given point, rather than estimating the global density function. For the latter problem, it is optimal for Alice to just quantize the samples and send it to Bob, which we show in the companion paper [28] . The mean square error (in norm) of estimating global density function scales differently than the case of point-wise density estimation since the messages cannot be tailored to the given point.
Related work. Besides function computation, distribution estimation and testing, other problems which have been studied in the communication complexity or privacy settings include lossy source coding [25] and common randomness or secret key generation [47][30][32][43]. The key technical tool for interactive two-way communication models, namely the tensorization of internal and external information ((20) ahead), appeared in [25] for lossy compression, [9][33] for function computation, [47][32] for common randomness generation, and [48][19] for parameter estimation.
For one-way communication models, the main tool is a tensorization property related to the strong data processing constant (see (10) ahead), which was first used in [4] in the study of the error exponents in communication constrained hypothesis testing. The hypercontractivity method for single-shot bounds in one-way models was used in [30][27] for common randomness generation and [38] for testing.
In statistics, communication-constrained estimation has received considerable attention recently, starting from [52], which considered a model where distributed samples are compressed and sent to a central estimator. Further works on this communication model include settings of Gaussian location estimation [10][11], parametric estimation [22], nonparametric regression [53], Gaussian noise model [54], statistical inference [2], and nonparametric estimation [21](with a bug fixed in [7]) [1]. Related problems solved using similar techniques include differential privacy [16] and data summarization [37][46][45]. Communication-efficient construction of test statistics for distributed testing using the divide-and-conquer algorithm is studied in [8]. Generally speaking, these works on statistical minimax rates concern the so-called horizontal partitioning of data sets, where data sets share the same feature space but differ in samples [49][18]. In contrast, vertical distributed or federated learning, where data sets differ in features, has been used by corporations such as those in finance and medical care [49][18]. It is worth mentioning that such horizontal partitioning model was also introduced in Yao’s paper [50] in the context of function computation under the name “simultaneous message model”, where different parties send messages to a referee instead of to each other. The direct sum property (similar to the tensorization property of internal and external information) of the simultaneous message model was discussed in [13].
Organization of the paper. We review the background on nonparametric estimation, data processing constants and testing independence in Section II. The formulation of the two-party nonparametric estimation problem and the summary of main results are given in Section III. Section IV examines the problem of estimating a parameter in a pair of biased Bernoulli distributions, which will be used as a building block in our nonparametric estimation algorithm. Section V proves some bounds on information exchanges, which will be the key auxiliary results for the proof of upper bound for Bernoulli estimation in Section VI, and for nonparametric estimation in Section VII. Finally, lower bounds are proved in Section VIII in the one-way case and in Section IX in the interactive case.
II Preliminaries
II-A Notation
We use capital letters for probability measures and lower cases for the densities functions. We use the abbreviations and . We use boldface letters to denote vectors, for example . Unless otherwise specified, the base of logarithms can be arbitrary but remain consistent throughout equations. The precise meaning of the Landau notations, such as , will be explained in each section or proofs of specific theorems. We use to denote summing over . For the vector representation of a binary probability distribution, we use the convention that . For the matrix representation of a joint distribution of a pair of binary random variables, we use the convention that . For , we use the shorthand .
II-B Nonparametric Estimation
Let us recall the basics about the problem of estimating a smooth density; more details may be found in [44, 41]. Let be an integer, and be a multi-index. For , let denote the differential operator
(1) |
Given , let be the maximum integer strictly smaller than [44] (note the difference with the usual conventions). Given a function whose domain includes a set , define as the minimum such that
(2) |
for all multi-indices such that . For example, define a Lipschitz function and an integer defines a function with bounded -th derivative.
Given , let be the class of probability density functions satisfying . Let be arbitrary. The following result on the minimax estimation error is well-known:
(3) |
where the infimum is over all estimators of , i.e., measurable maps from i.i.d. samples to . in (3) may hide constants independent of .
We say is a kernel of order () if and all up to the -th derivatives of the Fourier transform of vanish at [44, Definition 1.3]. Therefore the rectangular kernel, which is the indicator of a set, is order 1. A kernel estimator has the form
(4) |
where is called bandwidth. If is a kernel of order , then the kernel estimator (4) with appropriate achieves the bound in (3) [44, Chapter 1]. In particular, the rectangular kernel is minimax optimal for .
If is compactly supported, then only local smoothness is needed, and density lower bound does not change the rate: we have
(5) |
where is any compact neighborhood of , is arbitrary (with denoting the volume of ), and denotes the non-empty set of probability density functions satisfying and .
II-C Strong and Symmetric Data Processing Constants
The strong data processing constant has proved useful in many distributed estimation problems [10, 4, 16, 52]. In particular, it is strongly connected to two-party hypothesis testing under the one-way protocol. In contrast, the symmetric data processing constant [32] can be viewed as a natural extension to interactive protocols. This section recalls their definitions and auxiliary results, which will mainly be used in the proofs of lower bounds; however, the intuitions are useful for the upper bounds as well.
Given two probability measures , on the same measurable space, define the KL divergence
(6) |
Define the -divergence
(7) |
Let be two random variables with joint distribution . Define the mutual information
(8) |
Definition 1.
Let be an arbitrary distribution on . Define the strong data processing constant
(9) |
where is a conditional distribution (with being an arbitrary set), and the mutual informations are computed under the joint distribution .
Clearly, the value of does not depend on the choice of the base of logarithm. A basic yet useful property of the strong data processing constant is tensorization: if then
(10) |
Now if are the samples observed by Alice and Bob, denotes the message sent to Bob, then implies that
(11) |
The left side is the KL divergence between the distribution under the hypothesis that follows some joint distribution, and the distribution under the hypothesis that and are independent. Thus the error probabilities in testing against independence with one-way protocols can be lower bounded. This simple argument dates back at least to [4, 3].
A similar argument can be extended to testing independence under interactive protocols [48]. The fundamental fact enabling such extensions is the tensorization of certain information-theoretic quantities, which appeared in various contexts [25, 9, 32].
Definition 2.
Let . For given , define as the supremum of such that there exists random variables satisfying
(12) | |||
(13) |
and
(14) | |||
(15) |
are Markov chains. We call the symmetric data processing constant.
Let us remark that using the Markov chains we have the right side of (12)
(16) | |||
(17) |
whereas the right side of (13)
(18) |
In the computer science literature [9], is called the external information whereas the internal information.
The symmetric strong data processing constant is symmetric in the sense that , since in the definition. On the other hand, coincides with the strong data processing constant which is generally not symmetric. Furthermore, a tensorization property holds for the internal and external information: denote by the set of all satisfying (12) and (13) for some . Let . Then
(19) |
In particular, taking the slope of the boundary at the original yields
(20) |
A useful and general upper bound on in terms of SVD was provided in [32, Theorem 4], which implies that when and are unbiased Bernoulli. However, that bound is not tight enough for the nonparametric estimation problem we consider, and in fact we adopt a new approach in Section IX for the biased Bernoulli distribution. Let us remark that holds also for Gaussian , which follows by combining the result on unbiased Bernoulli distribution and a central limit theorem argument [32] (see also [19]). Moreover, it was conjectured in [32] that the set of possible satisfying (12)-(13) does not depend on when and are unbiased Bernoulli.
II-D Testing Against Independence
Consider the following setting: is an arbitrary distribution on ; ; is a sequence of random variables, with being given and satisfying , for and for ; is the distribution under the hypothesis of independence, and . The following result is known in [48, 20, 19]:
Lemma 1.
.
III Problem Setup and Main Results
We consider estimating the density function at a given point, where the density is assumed to be Hölder continuous in a neighborhood of that point. It is clear that there is no loss of generality assuming such neighborhood to be the unit cube, and that the given point is its center. More precisely, the class of densities under consideration is defined as follows:
Definition 3.
Given , , , and , let be the set of all probability density on (where ) satisfying
(23) |
and
(24) |
Definition 4.
We say is a prefix code [14] if it is a subset of the set of all finite non-empty binary sequences satisfying the property that for any distinct , cannot be a prefix of .
The problem is to estimate the density at a given point of an unknown distribution from . More precisely,
-
•
is a fixed but unknown distribution whose corresponding density belongs to for some , , and .
-
•
Infinite sequence of pairs , ,…are i.i.d. according to . Alice (Terminal 1) observes and Bob (Terminal 2) observes .
-
•
Unlimited common randomness is observed by both Alice and Bob. That is, an infinite random bit string independent of shared by Alice and Bob.
-
•
For ( is an integer), if is odd, then Alice sends to Bob a message , which is an element in a prefix code, where is computed using the common randomness , the previous transcripts , and ; if is even, then Bob sends to Alice a message computed using , , and .
-
•
Bob computes an estimate of the true density , where is the center of .
One-way NP Estimation Problem. Suppose that . Under the constraint on the expected length of the transcript (i.e. length of the bit string)
(25) |
where is a real number, what is the minimax risk
(26) |
Interactive NP Estimation Problem. Under the same constraint on the expected length of the transcript, but without any constraint on the number of rounds , what is the minimax risk?
Remark 1.
The prefix condition ensures that Bob knows that the current round has terminated after finishing reading each . Alternatively, the problem can be formulated by stating that is a random variable in an arbitrary alphabet, and replacing (25) by the entropy constraint . Furthermore, one may use the information leakage constraint instead. From our proofs it is clear that the minimax rates will not change under these alternative formulations.
Remark 2.
There would be no essential difference if the problem were formulated with almost surely and with probability (say) at least . Indeed, for the upper bound direction, those conditions are satisfied with a truncation argument, once we have an algorithm satisfying and , by Markov’s inequality and the union bound, therefore results only differ with a constant factor. For the lower bound, the proof can be extended to the high probability version, since we used the Le Cam style argument [51].
Remark 3.
The common randomness assumption is common in the communication complexity literature, and, in some sense, is equivalent to private randomness [34]. In our upper bound proof, the common randomness is the randomness in the codebooks. Random codebooks give rise to convenient properties, such as the fact that the expectation of the distribution of the matched codewords equals exactly the product of idealized single-letter distributions (82). It is likely, however, that some approximate versions of these proofs steps, and ultimately the same asymptotic risk, should hold for some carefully designed deterministic codebooks.
Theorem 1.
In one-way NP estimation, for any , , and ,
(27) |
where hides multiplicative factors depending on , , and .
The proof of the upper bound is in Section VII-B. Recall that nonparametric density estimation using a rectangular kernel is equivalent to counting the frequency of samples in a neighborhood of a given diameter, the bandwidth, which we denote as . A naive protocol is for Alice to send the indices of samples in . Locating each sample in that neighborhood requires on average bits. Thus samples in that neighborhood can be located. It turns out that the naive protocol is asymptotically optimal.
The proof of the lower bound (Section VIII) follows by a reduction to testing independence for biased Bernoulli distributions, via a simulation argument. Although some arguments are similar to [19], the present problem concerns biased Bernoulli distributions instead. The (KL) strong data processing constant turns out to be drastically different from the data processing constant, as opposed to the cases of many familiar distributions such as the unbiased Bernoulli or the Gaussian distributions.
As alluded, our main result is that the risk can be strictly improved when interactions are allowed:
Theorem 2.
In interactive NP estimation, for any , , and , we have
(28) |
where hides multiplicative factors depending on , and .
To achieve the scaling in (28), can grow as slowly as the super-logarithm (i.e., inverse tetration) of ; for the precise relation between and , see Section V-C.
The proof of the upper bound of Theorem 2 is given in Section VII-C, which is based on a novel multi-round estimation scheme for biased Bernoulli distributions formulated and analyzed in Sections IV,V,VI. Roughly speaking, the intuition is to “locate” the samples within neighborhoods of by successive refinements, which is more communication-efficient than revealing the location at once.
IV Estimation of Biased Bernoulli Distributions
In this section, we shall describe an algorithm for estimating the joint distribution of a pair of biased Bernoulli random variables. The biased Bernoulli estimation problem can be viewed as a natural generalization of the Gap hamming problem [24][12] to the sparse setting, and is the key component in both the upper and lower bound analysis for the nonparametric estimation problem. Indeed, we shall explain in Section VII that our nonparametric estimator is based on a linear combination of rectangle kernel estimators, which estimate the probability that and fall into neighborhoods of and . Indicators that samples are within such neighborhoods are Bernoulli variables, so that the biased Bernoulli estimator can be used. For the lower bound, we shall explain in Section VIII that the nonparametric estimation problem can be reduced to the biased Bernoulli estimation problem via a simulation argument.
For notational simplicity, we shall use for the Bernoulli variables in this section as well as Sections V-VI, although we should keep in mind that these are not the continuous variables in the original nonparametric estimation problem.
Bernoulli Estimation Problem:
-
•
Fixed real numbers , and an unknown .
-
•
i.i.d. according to the distribution
(31) where we recall our convention that the upper left entry of the matrix denotes the probability that . Alice observes and Bob observes .
-
•
Unlimited common randomness .
Goal: Alice and Bob exchange messages in no more than rounds in order to estimate .
Our algorithm is described as follows:
Input. ; positive integer and ; a sequence of real numbers satisfying
(32) | |||
(33) |
The can be viewed as parameters of the algorithm, and controls how much information is revealed about the locations of “common zeros” of in each round of communication. For example, setting and all other yields a one-way communication protocol, whereas setting all yields a “successive refinement” algorithm which may incur smaller communication budget yet convey the same amount of information.
Before describing the algorithm, let us define a conditional distribution by recursion, which will be used later in generation random codebooks.
Definition 5.
For each , define
(34) | ||||
(35) | ||||
(36) |
Then set . For even, we use similar definitions, but with the roles of and switched. This specifies , .
Note that by Definition 5, implies for each . In words, for odd, marks all as , and marks as either or ; whenever is marked, then is definitely 1, and will be forgotten in all subsequent rounds. Now set
(37) |
where is induced by in Definition 5.
Initialization. By applying a common function to the common randomness, Alice and Bob can produce a shared infinite array , where , , , such that the entries in the array are independent random variables, with . Also set
(38) |
Iterations. Consider any , where is odd. We want to generate by selecting a codeword so that follows the distribution of , where is defined in previous rounds. Define
(39) | ||||
(40) | ||||
(41) |
Note that Alice knows both and , while Bob knows , since it will be seen from the recursion that Alice and Bob both know at the beginning of the -th round. Alice chooses as the minimum nonnegative integer such that
(42) |
Alice encodes using a prefix code, e.g. Elias gamma code [17], and sends it to Bob. Then both Alice and Bob compute by
(43) | ||||
(44) |
The operations in the -th round for even is similar, with the roles of Alice and Bob reversed. We will see later that the notation is consistent in the sense of (53).
Estimator. Recall that in classical parametric statistics, one can evaluate the score function at the sample, compute its expectation and variation, and construct an estimator achieving the Cramer-Rao bound asymptotically. Now for , define the score function
(45) | ||||
(48) |
where is induced by . For , define similarly with the roles of and reversed. Alice and Bob can each compute
(49) |
and
(50) |
respectively. Finally, Alice’s and Bob’s estimators are given by
(51) | ||||
(52) |
where refers to expectation when the true parameter is , and denotes the derivative in . We will show that these estimators are well-defined: and are independent of (Lemma 3), and can be computed by Alice and Bob without knowing .
Lemma 2.
Lemma 3.
and are linear in .
Proof.
By (53),
(54) | |||
(55) |
for each odd and , and similar expressions hold for even. The claims then follow. ∎
Theorem 3.
and are unbiased estimators.
V Bounds on Information Exchanges
In this section we prove key auxiliary results that will be used in the upper bounds.
V-A General
The following Theorem is crucial for the achievability part of the analysis of the Bernoulli estimation problem described in Section IV (and hence for the nonparametric estimation problem). Specifically, (56)-(57) bounds the communication from Alice to Bob and in reverse, and (58)-(59) bounds the information exchanged which, in turn, will bound the estimation risk via Fano’s inequality.
Theorem 4.
The proof can be found in Appendix A.
Remark 4.
Remark 5.
Let us provide some intuition why interaction helps, assuming the case of for simplicity. From the proof of Theorem 4, it can be seen that up to a constant factor, is at least (which is in fact sharp as will be seen from the upper bound on in Theorem 6). Moreover, lower bounds on can be computed by replacing the integrals with discrete sums with terms:
(65) |
where . In particular, when , we recover , whereas choosing , shows that achieves . Even better, later we will take as the -th iterated power of , and then will be the super logarithm of .
Recall that control the amount of information revealed in each round and serve as hyperparameters of the algorithm to be tuned. Next we shall explain how to select the value of so that the optimal performance is achieved in the one-way and interactive settings.
V-B Case
Specializing Theorem 4 we obtain:
Corollary 4.
For any , with and we have
(66) | ||||
(67) |
V-C Case
Denote by the -th tetration of 2, which is defined recursively by and
(68) |
Let , and let be the minmum integer such that
(69) |
For we have . Then we set
(70) | ||||
(71) | ||||
(72) |
which fulfills . We see that
(73) | |||
(74) | |||
(75) | |||
(76) |
The first inequality above follows by . Similarly we also have . Moreover,
(77) |
Summarizing, we have
Corollary 5.
Let us remark that the sequence we used in (70)-(72) is essentially optimal: Let . In order for (73) to converge, we need to be convergent. Therefore cannot grow faster than which is tetration. However this only amounts to a lower bound on for a particular design of in Definition 5. Since tetration grows super fast, from a practical viewpoint is essentially a constant. Nevertheless, it remains an interesting theoretical question whether needs to diverge:
Conjecture 1.
If there is an algorithm (indexed by ) achieving the optimal risk (28) for nonparametric estimation, then necessarily as .
VI Achievability Bounds for Bernoulli Estimation
In this section we analyze the performance of the Bernoulli distribution estimation algorithm described in Section IV.
VI-A Communication Complexity
Consider any . Denoting by the empirical distribution of , we have from (53) that
(82) |
In particular,
(83) |
Let be the number of bits need to encode the positive integer using the Elias gamma code [17]. For each we have
(84) | ||||
(85) | ||||
(86) | ||||
(87) |
where (86) follows from the selection rule (42) and the formula of expectation of the geometric distribution. Then
(88) | ||||
(89) |
where (88) used (83); (89) used the fact that is dominated by . Note that in (88)-(89) denotes the value of the vector .
VI-B Expectation of
Recall that was defined in (48). Pick arbitrary . Since
(90) |
and since and are independent of , we obtain
(91) |
Next, observe that for any ,
(92) | |||
(93) | |||
(94) |
Moreover, for any ,
(95) | |||
(96) | |||
(97) | |||
(98) |
where (95) used (83); (96) used (94) and the linearity of in ; (97) used (91); (98) follows from (94). Thus
(99) |
where we defined
(100) |
Lemma 6.
Let . We have
(101) |
where the logarithmic base of the mutual information is natural.
VI-C Variance of
VI-D Case
We now prove achievability bounds for the Bernoulli distribution estimation algorithm.
Corollary 7.
Given , for and , the mean square error and total communication cost .
VI-E Case
Corollary 8.
Let , . For , defined in Section V-C, the mean square error and total communication cost .
VII Density Estimation Upper Bounds
In this section we prove the upper bounds in Theorem 1 and Theorem 2, by building nonparametric density estimators based on the Bernoulli distribution estimator. For , the rectangular kernel is minimax optimal (Section II-B), so that the integral with the kernel can be directly estimated using the Bernoulli distribution estimator, which we explain in Section VII-B and VII-C. Extension to higher order kernels is possible using a linear combination of rectangular kernels, which is explained in Section VII-D.
VII-A Density Lower Bound Assumption
First, we observe the following simple argument showing that it suffices to consider . Define
(127) |
where the supremum is over all density on satisfying . Clearly is finite and depends only on .
Lemma 9.
In either the one-way or the interactive setting, suppose that there exists an algorithm achieving for some and . Then, there must be an algorithm achieving .
Proof.
Pick one such that and . Since , such exists. Consider an arbitrary , and suppose infinite pairs i.i.d. according to are available to Alice and Bob. Using the common randomness, Alice and Bob can simulate i.i.d. pairs according to , by replacing each pair with probability with a new pair drawn according to . Clearly , and by assumption, can be estimated with mean square risk . Since is known, this implies that can be estimated with mean square risk . ∎
For the rest of the section, we will assume that there is a density lower bound and , which is sufficient in view of Lemma 9. Consider bandwidth (which will be specified later as an inverse polynomial of ). Also introduce the notations
(128) | ||||
(129) |
Define as the probability distribution induced by and with
(130) | ||||
(131) |
Define
(132) | ||||
(133) |
Note that is bounded above and below by positive constants depending on , , and (see (138) and (143)). Also, we can assume Alice and Bob both know and , since with infinite samples Alice and Bob know their marginal densities and , and Alice can send to Bob with very high precision using negligible number of bits. Let be the number such that is the matrix
(136) |
Let be Bob’s estimator of in (52). Then we define Bob’s density estimator:
(137) |
We next show that the smoothness of the density ensures that is at most the order of a constant. Recall that is a density lower bound on and . Define and . The definition of implies , and hence
(138) |
Recall defined in (127). We then have
(139) | ||||
(140) | ||||
(141) | ||||
(142) |
Next, observe that which yields . Similarly we also have , therefore
(143) |
Together with (138), we see that .
Next, the bias of the density estimator is
(144) |
which is just the bias of the rectangular kernel estimator (with bandwidth in each of the two subspaces). The rectangle kernel is order [44, Definition 1.3] and compactly supported while, by assumption , and therefore the bias is bounded by ([44, Proposition 1.2])
(145) |
where is a constant depending only on and .
VII-B One-Way Case
By Corollary 7 and (143), we can bound the variance of the density estimator as
(146) | ||||
(147) | ||||
(148) |
where (148) used (143). Also by Corollary 7, the communication constraint is satisfied if the following holds
(149) |
Now we can choose so that as defined by (132), and set
(150) |
where , defined as the right side of (142) and hence depends only on , is an upper estimate of the true parameter . Then the communication constraint is satisfied. Moreover by the bias (145) and the variance (148) bounds, the risk is bounded by
(151) | |||
(152) | |||
(153) |
where is a constant depending only on , , and , and we used the fact that is bounded above by (143) and the bound on shown in (138). This proves the upper bound in Theorem 1 for .
VII-C Interactive Case
Choose such that as defined by and (132)-(133) satisfies
(154) |
and set
(155) |
where as before is an upper bound on and only depends on . By Corollary 8, for large enough we have , and the communication cost is bounded by . Moreover from (152), the risk is bounded by for some depending only on , , and . This proves the upper bound in Theorem 2 when .
VII-D Extension to
For , the rectangular kernel is no longer minimax optimal. However, observe the following:
Proposition 10.
For any positive integers and , there exists an order -kernel in which is a linear combination of indicator functions.
Proof.
In the following we prove for ; the case of general will then follow by taking the tensor product of kernel functions in . Note that an -th kernel must satisfy the following equations:
(156) | ||||
(157) |
Let use consider of the following form:
(158) |
where . Since such is an even function, (156)-(156) yield nontrivial equations for (i.e., only when is even):
(159) | ||||
(160) |
From the formula for the determinant of the Vandermonde matrix, we see that these equations have a unique solution for . ∎
VIII One-way Density Estimation Lower Bound
VIII-A Upper Bounding
The pointwise estimation lower bound is obtained by lower bounding the risk for estimating (with and being indicators of neighborhoods of and ), and applying Le Cam’s inequality to the latter. Therefore we are led to considering the strong data processing constant for the biased Bernoulli distribution.
Theorem 5.
Let be as defined in (31). where and . Then .
Proof.
For this proof we can assume without loss of generality that the logarithms are natural. For any , let be the output through the channel . Then
(161) | ||||
(162) | ||||
(163) |
where we define the divergence in (7). The justification of the steps are as follows: (161) is well-known. (162) follows since the divergence dominates the KL divergence (see e.g. [39]). To see (163), note that is upper bounded by , the square of the maximal correlation coefficient (see e.g. [5, 6]). As the operator norm of a linear operator, can be shown to equal the second largest singular value of
(164) | ||||
(165) |
see e.g. [6]. Since is a symmetric matrix, its singular values are its eigenvalues. The largest eigenvalue of is 1, corresponding to the eigenvector (which is evident from (164)), whereas the trace
(166) |
which is evident from (165). Therefore . Moreover, since and divergences are both -divergences, their ratio can be bounded by the ratio of their corresponding -functions (see e.g. [39]):
(167) | ||||
(168) |
The constraint in (167) is because and . To show (168), it suffices to show that
is achieved at . For this, it suffices to show that the derivative of the objective function, is negative on . Indeed, define . We can check that , , and , which imply that for and for . Therefore (168), and hence (163), holds. ∎
VIII-B Lower Bounding One-Way NP Estimation Risk
Given , consider a distribution on with matrix
(171) |
where and , with being a constant. We then need to “simulate” smooth distributions from . Let be a function satisfying the following properties:
-
•
has a compact support;
-
•
;
-
•
;
-
•
, for all ;
-
•
;
-
•
Define as a function on . Then .
Clearly, such a function exists for any given . For sufficiently large such that and (recall that is the given point in the density estimation problem), define
(172) | |||
(173) |
Since , clearly the above define valid probability densities supported on . Define
(174) | |||
(175) |
which are also probability densities supported on . Define , where and are conditional distributions defined by the densities above. Under the joint distribution , we have
(176) |
Define
(177) |
We now check that the density of satisfies for sufficiently large. Indeed, for ,
(178) | |||
(179) | |||
(180) | |||
(181) | |||
(182) |
By the assumptions on , we see that
(183) | ||||
(184) | ||||
(185) |
Therefore with the choice , we have for .
Now we can apply a Le Cam style argument [51] for the estimation lower bound. Suppose that there exists an algorithm that estimates the density at as . Alice and Bob can convert this to an algorithm for testing the binary distributions against . Indeed, suppose that is an infinite sequence of i.i.d. random variable pairs according to either or . Using the local randomness (which is implied by the common randomness), Alice and Bob can simulate the sequence of i.i.d. random variables according to either or , by applying the random transformations and coordinate-wise. Then Alice and Bob can apply the density estimation algorithm to obtain . Note that while
(186) |
the latter following from (181). Now suppose that Bob declares if
(187) |
and otherwise. By Chebyshev’s inequality, the error probability (under either hypothesis) is upper bounded by
(188) |
On the other hand, from (22) and Theorem 5 we have
(189) | ||||
(190) | ||||
(191) | ||||
(192) |
when is sufficiently large. However, it is known (from Kraft’s inequality, see e.g. [14]) that the expected length of a prefix code upper bounds the entropy. Thus
(193) |
and therefore
(194) |
Then by Pinsker’s inequality (e.g. [44]),
(195) | ||||
(196) | ||||
(197) |
where the last line follows from our choice . However, lower bounds twice of (188). Therefore we have
(198) | |||
(199) |
which is lower bounded by for large enough . Since and depend only on , this establishes the lower bound in Theorem 1.
IX Interactive Density Estimation Lower Bound
In this section we prove the lower bound in Theorem 2.
IX-A Upper Bounding
The heart of the proof is the following technical result:
Theorem 6.
There exists small enough such that the following holds: For any which is a distribution on corresponding to the following matrix:
(202) |
where and we used the notation , we have
(203) |
The proof can be found in Appendix B.
IX-B Lower Bounding Interactive NP Estimation Risk
The proof is similar to the one-way case (Section VIII-B). Consider the distribution on as in (171). Let and , where with being the absolute constant in Theorem 6. Pick the function , and define and as before. Note that, as before, is uniform on , while for . has the same formula (186), and Alice and Bob can convert a (now interactive) density estimation algorithm to an algorithm for testing against . With the same testing rule (187), the error probability under either hypothesis is again upper bounded by (188).
Changes arise in (189), where we shall apply Theorem 6 instead. Note that for the absolute constant in Theorem 6, the condition is satisfied for sufficiently large (hence sufficiently large ).
(204) | ||||
(205) | ||||
(206) |
Again using Kraft’s inequality to Bound , we obtain
(207) |
Then Pinsker’s inequality yields
(208) |
since we selected . Again lower bounds twice of (188), therefore
(209) | |||
(210) | |||
(211) |
where the last line holds for sufficiently large . Since is a universal constant and depends on only, this completes the proof of the interactive lower bound.
X Acknowledgement
The author would like to thank Professor Venkat Anantharam for bringing the reference [36] to my attention and some interesting discussions. This research was supported by the starting grant from the Department of Statistics, University of Illinois, Urbana-Champaign.
Appendix A Proof of Theorem 4
First, assume that . By the definitions of and , we can verify that the following holds (for ):
(212) |
Indeed, (212) follows by applying induction on the following
(213) | ||||
(214) |
where , , and we used which in turn follows from the factorization
(215) | ||||
(216) |
Now from (212),
(217) | ||||
(218) |
Similarly, by switching the roles of and we have
(219) |
Therefore,
(220) | |||
(221) |
for any . We also see from (212) and (221) that for any odd,
(222) | ||||
(223) | ||||
(224) |
where the last step follows since . Therefore, the claim (56) follows, and (57) is similar.
Next, consider any . Define
(225) |
Observe that the construction of fulfills the Markov chain conditions (14)-(15), implying that
(226) |
We therefore have111We use the notation for .
(227) |
where we defined
(228) | ||||
(229) |
By continuity, we have
(230) | ||||
(231) |
as . It is also easy to see from (227) that (for this proof, only is the variable, and all other constants, such as and , can be hidden in the Landau notations). Therefore (227)(230)(231) yield
(232) | |||
(233) |
Using the fact that and are independent under , noting (212) and the assumption , we have
(234) |
where is the largest odd integer not exceeding . Similarly we also have . Consequently, (233) yields
(235) | ||||
(236) |
Moreover, let us define
(237) |
In the following paragraph we consider arbitrary , and we shall omit the superscripts for , , , unless otherwise noted. Then
(238) |
We can verify that . Therefore,
(239) | ||||
(240) |
Therefore as ,
(241) | ||||
(242) | ||||
(243) | ||||
(244) |
where denotes the binary divergence function, and recall that we assumed the natural base of logarithms. On the other hand, . Therefore
(245) | ||||
(246) | ||||
(247) |
Turning back to (238), we obtain
(248) | |||
(249) | |||
(250) |
where (249) follows since (218) implies
(251) | ||||
(252) | ||||
(253) | ||||
(254) |
and
(255) | ||||
(256) | ||||
(257) | ||||
(258) |
Moreover, by (212),
(259) | ||||
(260) | ||||
(261) | ||||
(262) |
Similarly,
(263) |
(264) | ||||
(265) | ||||
(266) | ||||
(267) |
and hence
(268) |
establishing the claim (58) of the theorem. The proof of (59) is similar.
Appendix B Proof of Theorem 6
We can choose the natural base of logarithms in this proof. Choose satisfying the Markov chain conditions (14)-(15) and so that
(269) | ||||
(270) |
which is possible by the definition of .
Given , define by the unique distribution222Alternatively, equals the -projection under the constraints and [15, Corollary 3.3]. such that
(271) |
for some functions and , and such that the marginals are and . For the existence of , see e.g. [23, 40]. Define as the mutual information of under . Define as the number such that is the matrix
(274) |
Given any , let be such that . Define similarly but for . With these notations, note that
(275) |
and
(276) | |||
(277) |
where we recall that denotes the binary divergence function. Define
(278) |
Then note that is a smooth nonnegative function on with vanishing value and first derivatives at . Also, define
(279) |
where . Then is also a smooth function on with vanishing value and first derivatives at . Moreover, due to (275), we have
(280) |
Thus to prove the theorem it suffices to show the existence of sufficiently small , such that for any , there is
(281) |
where the sup is over .
-
•
Case 1: .
Since for , we have(282) for . Now if we can show that
(283) (284) we will obtain which matches (281). Here and below, means that there is an absolute constant such that when in the theorem statement (and hence and ) is sufficiently small.
Before explicitly computing , we give some intuitions why we should expect (284) to be true. For fixed , we will show that
(285) as , where we defined
(286) If the difference between and could be neglected, then (284) should hold. To see (285), for given , note that (271) implies,
(287) Under the assumption , the above linearizes to
(288) Moreover, note that
(289) (290) where the last step follows by comparing with (288). Since as , we see (285) holds. Of course, (285) does not really show (284) since approximation of function values generally does not imply approximation of the derivatives. However, we shall next explicitly take the derivatives to give a real proof, and the above observations are useful guides.
First, note that
(291) (292) where . Next, we express the first and second derivatives, , and , in terms of . Differentiating the logarithm of (287) in yields
(293) In the rest of the proof the notation means (recall the definition of in (284)), and if . Note that for we have
(294) since the right side of (287) clearly equals . Then by (293),
(295) and hence,
(296) Expression of can be found similarly. Moreover, differentiating (293) we get
(297) from which we can deduce that
(298) Now, taking the derivative in (292), we obtain
(299) In deriving (299), we applied the Taylor expansions of and . Plugging (296) and (298) into (299), we obtain
(300) Next, we control . Similarly to (293), we have
(301) Further taking the derivative, we obtain
(302) Next, we shall use the assumption of to simplify (302) as
(303) Since, analogous to (296), we have
(304) we see that (303) implies
(305) Tighter estimates of are possible, but the above will suffice. We now take the derivative of (292) in :
(306) where
(307) and
(308) We can Taylor expand using the facts that , , to obtain
(309) We can Taylor expand using the fact that (analogous to (296)) to obtain
(310) Thus
(311) By symmetry same bound holds for as well. Together with (300), we thus validated (284), and consequently (281) in this case.
-
•
Case 2: .
Without loss of generality assume that and . From (292), we have(312) Using (304) with and , we obtain . Thus Taylor expanding the above displayed, we obtain
(313) Then
(314) (315) (316) where we used the assumption that and the fact that . Now the assumption of implies
(317) To see the second inequality in (317), note that
(318) (319) where the minimization is easily solved by checking that the derivative is positive for . Finally, combining (317) with (316), we obtain , as desired.
-
•
Case 3: , .
Assume without loss of generality that . In this case, using (313), we have(320) (321) (322) (323) On the other hand,
(324) (325) (326) (327) Thus we once again obtain , as desired.
References
- [1] Jayadev Acharya, Clement Canonne, Aditya Vikram Singh, and Himanshu Tyagi. Optimal rates for nonparametric density estimation under communication constraints. Advances in Neural Information Processing Systems, 34, 2021.
- [2] Jayadev Acharya, Clément L Canonne, and Himanshu Tyagi. Inference under information constraints II: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 66(12):7856–7877, 2020.
- [3] Rudolf Ahlswede and Marat V Burnashev. On minimax estimation in the presence of side information about remote data. Annals of Statistics, pages 141–171, 1990.
- [4] Rudolf Ahlswede and Imre Csiszár. Hypothesis testing with communication constraints. IEEE Transactions on Information Theory, 32(4), 1986.
- [5] Rudolf Ahlswede and Peter Gács. Spreading of sets in product spaces and hypercontraction of the Markov operator. The Annals of Probability, 4:925–939, 1976.
- [6] Venkat Anantharam, Amin Gohari, Sudeep Kamath, and Chandra Nair. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv:1304.6133, 2013.
- [7] Leighton Pate Barnes, Yanjun Han, and Ayfer Özgür. Lower bounds for learning distributions under communication constraints via fisher information. The Journal of Machine Learning Research, 21(1):9583–9612, 2020.
- [8] Heather Battey, Jianqing Fan, Han Liu, Junwei Lu, and Ziwei Zhu. Distributed testing and estimation under sparse high dimensional models. Annals of Statistics, 46(3):1352, 2018.
- [9] M. Braverman and A. Rao. Information equals amortized communication. In FOCS, pages 748–757, 2011.
- [10] Mark Braverman, Ankit Garg, Tengyu Ma, Huy L Nguyen, and David P Woodruff. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1011–1020, 2016.
- [11] T. Tony Cai and Hongji Wei. Distributed Gaussian mean estimation under communication constraints: Optimal rates and communication-efficient algorithms. arXiv:2001.08877, 2020.
- [12] Amit Chakrabarti and Oded Regev. An optimal lower bound on the communication complexity of gap-Hamming distance. SIAM Journal on Computing, 41(5):1299–1317, 2012.
- [13] Amit Chakrabarti, Yaoyun Shi, Anthony Wirth, and Andrew Yao. Informational complexity and the direct sum problem for simultaneous message complexity. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 270–278. IEEE, 2001.
- [14] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. 2nd Edition, Wiley, July 2006.
- [15] I. Csiszar. I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1):146–158, February, 1975.
- [16] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438, 2013.
- [17] P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975.
- [18] Peter Kairouz et al. Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning Series. 2021.
- [19] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Communication complexity of estimating correlations. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 792–803, 2019.
- [20] Uri Hadar, Jingbo Liu, Yury Polyanskiy, and Ofer Shayevitz. Error exponents in distributed hypothesis testing of correlations. In 2019 IEEE International Symposium on Information Theory (ISIT), pages 2674–2678, 2019.
- [21] Yanjun Han, Pritam Mukherjee, Ayfer Ozgur, and Tsachy Weissman. Distributed statistical estimation of high-dimensional and nonparametric distributions. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 506–510. IEEE, 2018.
- [22] Yanjun Han, Ayfer Özgür, and Tsachy Weissman. Geometric lower bounds for distributed parameter estimation under communication constraints. In Conference On Learning Theory, pages 3163–3188. PMLR, 2018.
- [23] Charles Hobby and Ronald Pyke. Doubly stochastic operators obtained from positive operators. Pacific Journal of Mathematics, 15(1):153–157, 1965.
- [24] Piotr Indyk and David Woodruff. Tight lower bounds for the distinct elements problem. In Proceedings of 44th Annual IEEE Symposium on Foundations of Computer Science, pages 283–288, 2003.
- [25] Amiram Kaspi. Two-way source coding with a fidelity criterion. IEEE Transactions on Information Theory, 31(6):735–740, 1985.
- [26] Eyal Kushilevitz and Noam Nisan. Communication Complexity. Cambridge University Press, 1997.
- [27] Jingbo Liu. Information Theory from A Functional Viewpoint. Ph.D. thesis, Princeton University, Princeton, NJ, 2018.
- [28] Jingbo Liu. Communication complexity of two-party nonparametric global density estimation. In 56th Annual Conference on Information Sciences and Systems (CISS), 2022.
- [29] Jingbo Liu. Interaction improves two-party nonparametric pointwise density estimation. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 904–909, 2022.
- [30] Jingbo Liu, Thomas A Courtade, Paul Cuff, and Sergio Verdú. Smoothing Brascamp-Lieb inequalities and strong converses for common randomness generation. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 1043–1047.
- [31] Jingbo Liu, Paul Cuff, and Sergio Verdú. Key generation with limited interaction. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2918–2922. IEEE, 2016.
- [32] Jingbo Liu, Paul Cuff, and Sergio Verdú. Secret key generation with limited interaction. IEEE Transactions on Information Theory, 63(11):7358–7381, 2017.
- [33] Nan Ma and Prakash Ishwar. Some results on distributed source coding for interactive function computation. IEEE Transactions on Information Theory, 57(9):6180–6195, 2011.
- [34] Ilan Newman. Private vs. common random bits in communication complexity. Information processing letters, 39(2):67–71, 1991.
- [35] Noam Nisan and Avi Widgerson. Rounds in communication complexity revisited. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 419–429, 1991.
- [36] Alon Orlitsky. Worst-case interactive communication. I. two messages are almost optimal. IEEE Transactions on Information Theory, 36(5):1111–1126, 1990.
- [37] Jeff M Phillips and Wai Ming Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
- [38] K. R. Sahasranand and Himanshu Tyagi. Extra samples can reduce the communication for independence testing. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 2316–2320, 2018.
- [39] Igal Sason and Sergio Verdú. -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
- [40] Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967.
- [41] Charles J Stone. Optimal rates of convergence for nonparametric estimators. Annals of Statistics, pages 1348–1360, 1980.
- [42] Madhu Sudan, Badih Ghazi, Noah Golowich, and Mitali Bafna. Communication-rounds tradeoffs for common randomness and secret key generation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1861–1871. SIAM, 2019.
- [43] Madhu Sudan, Himanshu Tyagi, and Shun Watanabe. Communication for generating correlation: A unifying survey. IEEE Transactions on Information Theory, 66(1):5–37, 2019.
- [44] Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
- [45] Paxton Turner, Jingbo Liu, and Philippe Rigollet. Efficient interpolation of density estimators. In International Conference on Artificial Intelligence and Statistics, pages 2503–2511. PMLR, 2021.
- [46] Paxton Turner, Jingbo Liu, and Philippe Rigollet. A statistical perspective on coreset density estimation. In International Conference on Artificial Intelligence and Statistics, pages 2512–2520. PMLR, 2021.
- [47] Himanshu Tyagi. Common information and secret key capacity. IEEE Transactions on Information Theory, 59(9):5627–5640, 2013.
- [48] Yu Xiang and Young-Han Kim. Interactive hypothesis testing against independence. In 2013 IEEE International Symposium on Information Theory, pages 2840–2844, 2013.
- [49] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. Federated machine learning: Concept and applications. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–19, 2019.
- [50] Andrew Chi-Chih Yao. Some complexity questions related to distributive computing (preliminary report). In Proceedings of the eleventh annual ACM symposium on Theory of computing, pages 209–213, 1979.
- [51] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam, pages 423–435, 1996.
- [52] Yuchen Zhang, John C Duchi, Michael I Jordan, and Martin J Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In NIPS, pages 2328–2336, 2013.
- [53] Yuancheng Zhu and John Lafferty. Distributed nonparametric regression under communication constraints. In International Conference on Machine Learning, pages 6009–6017, 2018.
- [54] Yuancheng Zhu and John Lafferty. Quantized minimax estimation over Sobolev ellipsoids. Information and Inference: A Journal of the IMA, 7(1):31–82, 2018.
Jingobo Liu received the B.S. degree in Electrical Engineering from Tsinghua University, Beijing, China in 2012, and the M.A. and Ph.D. degrees in Electrical Engineering from Princeton University, Princeton, NJ, USA, in 2014 and 2017. He was a Norbert Wiener Postdoctoral Research Fellow at the MIT Institute for Data, Systems, and Society (IDSS) during 2018-2020. Since 2020, he has been an assistant professor in the Department of Statistics and an affiliate in the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign, IL, USA. His research interests include information theory, high dimensional statistics and probability, coding theory, and the related fields. His undergraduate thesis received the best undergraduate thesis award at Tsinghua University (2012). He gave a semi-plenary presentation at the 2015 IEEE Int. Symposium on Information Theory, Hong-Kong, China. He was a recipient of the Princeton University Wallace Memorial Honorific Fellowship in 2016. His Ph.D. thesis received the Bede Liu Best Dissertation Award of Princeton and the Thomas M. Cover Dissertation Award of the IEEE Information Theory Society (2018). |