Agreement Implies Accuracy for Substitutable Signals

Rafael Frongillo, Eric Neyman, Bo Waggoner

Inspired by Aumann’s agreement theorem, [Aar05] studied the amount of communication necessary for two Bayesian experts to approximately agree on the expectation of a random variable. Aaronson showed that, remarkably, the number of bits does not depend on the amount of information available to each expert. However, in general the agreed-upon estimate may be inaccurate: far from the estimate they would settle on if they were to share all of their information. We show that if the experts’ signals are substitutes—meaning the experts’ information has diminishing marginal returns—then it is the case that if the experts are close to agreement then they are close to the truth. We prove this result for a broad class of agreement and accuracy measures that includes squared distance and KL divergence. Additionally, we show that although these measures capture fundamentally different kinds of agreement, Aaronson’s agreement result generalizes to them as well.

1 Introduction

Suppose that Alice and Bob are honest, rational Bayesians who wish to estimate some quantity—say, the unemployment rate one year from now. Alice is an expert on historical macroeconomic trends, while Bob is an expert on contemporary monetary policy. They convene to discuss and share their knowledge with each other until they reach an agreement about the expected value of the future unemployment rate. Alice and Bob could reach agreement by sharing everything they had ever learned, at which point they would have the same information, but the process would take years. How then should they proceed?

In the seminal work “Agreeing to Disagree,” Aumann [Aum76] observed that Alice and Bob can reach agreement simply by taking turns sharing their current expected value for the quantity. In addition to modeling communication between Bayesian agents, protocols similar to this one model financial markets: each trader shares partial information about their expected value on their turn (discussed in Section 5). A remarkable result by Scott Aaronson [Aar05] shows that if Alice and Bob follow certain protocols of this form, they will agree to within $\epsilon$ with probability $1-\delta$ by communicating $O\left(\frac{1}{\delta\epsilon^{2}}\right)$ bits.¹¹1To ensure that each message is short, Alice and Bob share discretized versions of their estimates; we discuss this in Section 2. Notably, this bound only depends on the error Alice and Bob are willing to tolerate, and not on the amount of information available to them.

Absent from Aaronson’s results, however, is what Alice and Bob agree on. In particular, there is no guarantee that Alice and Bob will be accurate, meaning their agreed-upon estimate will be close (in e.g. expected squared distance) to what they would believe if they shared all of their information. In fact, they might agree on something highly inaccurate: suppose that Alice and Bob have independent, uniformly random bits $b_{A},b_{B}$ , and wish to estimate the XOR $b_{A}\oplus b_{B}$ . Alice and Bob agree from the onset, as from each of their perspectives, the expected value of $b_{A}\oplus b_{B}$ is $\frac{1}{2}$ . Yet this expectation is far from the best estimate given their collective knowledge, which is either $0$ or $1$ . So while agreement is fundamental to understanding communication between Bayesians—in Aumann’s terms, they cannot “agree to disagree”—agreement is far from the whole story. An important open problem is therefore what assumptions guarantee that Alice and Bob are accurate once they agree.

We address this open problem by introducing a natural condition, called rectangle substitutes, under which agreement implies accuracy. Rectangle substitutes is a notion of informational substitutes: the property that additional information has diminishing marginal returns. The notion of substitutes is ubiquitous in optimization problems, and informational substitutes conditions have recently been used to analyze equilibria in markets [CW16]. We show that under the rectangle substitutes condition, any protocol leading to agreement will also lead to accuracy. We then extend these results beyond the case of squared error, to a broad family of measures of agreement and accuracy including KL divergence.

1.1 Overview of approach and results

In [Aar05], Alice and Bob are said to agree if the squared distance between their estimates is small. Likewise, we can say that Alice and Bob are accurate if the squared distance between each of their estimates and the truth is small. In Section 3 we present our first main result: under these definitions, if the information structure satisfies rectangle substitutes, then agreement implies accuracy. In other words, under this assumption, when two Bayesians agree—regardless of how little information they have shared—they necessarily agree on the truth.

The proof involves carefully partitioning the space of posterior beliefs induced by the protocol. Agreement is used to show that Alice and Bob usually fall into the same partition element, which means that Bob would not learn much from learning the partition element of Alice’s expectation. Then, the rectangle substitutes condition is used to show that if Bob were to learn Alice’s partition element, then he would be very close to knowing the truth.

Aaronson measures agreement in terms of squared error, yet other measurements like KL divergence may be better suited for some settings. For example, if Alice and Bob estimate the probability of a catastrophic event as $10^{-10}$ and $10^{-2}$ , respectively, then under squared error they are said to agree closely, but arguably they disagree strongly, as reflected by their large KL divergence. Motivated these different ways to measure agreement, we next ask:

1.

Can Aaronson’s protocols be generalized to other notions of agreement, such that the number of bits communicated is independent of the amount of information available to Alice and Bob?
2.

Do other notions of agreement necessarily imply accuracy under rectangle substitutes?

In Section 4, we give our second and third main results: the answer to both questions is yes. Specifically, the positive results apply when when measuring agreement and accuracy using Bregman divergences, a class of error measures that includes both squared distance and KL divergence.²²2The third result holds under an “approximate triangle inequality” condition on the Bregman divergence, which is satisfied by most or all natural choices; indeed, it is nontrivial to construct a Bregman divergence that does not satisfy this property.

Aaronson’s proof of his agreement theorem turns out to be specific to squared distance. Our agreement theorem (Theorem 4.11) modifies Aaronson’s protocol to depend on the particular Bregman divergence, i.e. the relevant error measure. It then proceeds in a manner inspired by Aaronson but using several new ideas. Our proof that agreement implies accuracy under rectangle substitutes for general Bregman divergences also involves some nontrivial changes to our proof for squared distance. In particular, the fact that the length of an interval cannot be inferred from the Bregman divergence between its endpoints necessitates a closer analysis of the partition of Alice’s and Bob’s beliefs.

We conclude in Section 5 with a discussion of connections between agreement protocols and information revelation in financial markets, and discuss an interesting potential avenue for future work.

1.2 Related Work

Our setting is related to but distinct from communication complexity. In that field (e.g. [RY20]), the goal is for Alice and Bob to correctly compute a function of their inputs while communicating as few bits as possible and using any protocol necessary. By contrast, [Aar05] considered a goal of agreement, not correctness, and focused on specific natural protocols, which he showed achieve this goal in a constant number of bits. Our work focuses on Aaronson’s setting. We discuss how our results might be framed in terms of communication complexity in Appendix E.

Our introduction of the substitutes condition is inspired by its usefulness in prediction markets [CW16]. The “expectation-sharing” agreement protocols we study bear a strong similarity to dynamics of market prices. [Ost12] introduced a condition under which convergence of prices in a market implies that all information is aggregated. This can be viewed as an “agreement implies accuracy” condition. We discuss these works and the connection of our work to markets in Section 5. Another similar definition of informational substitutes is used by [NR21] in the context of robust aggregation of forecasts.

Finally, we note that the “agreement protocols” we study are not related to key agreement protocols in cryptography, where the goal is for two communicating parties to jointly construct a shared string for cryptographic use.

2 Preliminaries

2.1 Information Structures

We consider a set $\Omega$ of states of the world, with a probability distribution $\mathbb{P}$ over the world states. There are two experts, Alice and Bob. Alice learns the value of a random variable $\sigma:\Omega\to\mathcal{S}$ ; we call $\sigma$ Alice’s signal and $\mathcal{S}$ her signal set. Correspondingly, Bob learns the value of a random variable $\tau:\Omega\to\mathcal{T}$ . These signals each convey partial information about the true state $\omega\in\Omega$ . Alice and Bob are interested in a third random variable $Y:\Omega\to[0,1]$ . We use the term information structure to refer to the tuple $\mathcal{I}:=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ .

We denote by $\mu_{\sigma\tau}:=\mathbb{E}\left[Y\mid\sigma,\tau\right]$ the random variable that is equal to the expected value of $Y$ conditioned on both Alice’s signal $\sigma$ and Bob’s signal $\tau$ . We also define $\mu_{\sigma}:=\mathbb{E}\left[Y\mid\sigma\right]$ and $\mu_{\tau}:=\mathbb{E}\left[Y\mod\tau\right]$ . For a measurable set $S\subseteq\mathcal{S}$ , we define $\mu_{S}:=\mathbb{E}\left[Y\mid\sigma\in S\right]$ ; we define $\mu_{T}$ analogously for $T\subseteq\mathcal{T}$ . Additionally, for $T\subseteq\mathcal{T}$ , we define $\mu_{\sigma T}:=\mathbb{E}\left[Y\mid\tau\in T,\sigma\right]$ , i.e. the expected value of $Y$ conditioned on the particular value of $\sigma$ and the knowledge that $\tau\in T$ . If Alice knows that Bob’s signal belongs to $T$ (and nothing else about his signal), then the expected value of $Y$ conditional on her information is $\mu_{\sigma T}$ ; we refer to this as Alice’s expectation. Likewise, for $S\subseteq\mathcal{S}$ , we define $\mu_{S\tau}:=\mathbb{E}\left[Y\mid\sigma\in S,\tau\right]$ . Finally, we define $\mu_{ST}:=\mathbb{E}\left[Y\mid\sigma\in S,\tau\in T\right]$ . This is the expectation of a third party who only knows that $\sigma\in S$ and $\tau\in T$ .

In general we often wish to take expectations conditioned on $\sigma\in S,\tau\in T$ (for some $S\subseteq\mathcal{S},T\subseteq\mathcal{T}$ ). We will use the shorthand $\mathbb{E}\left[\cdot\mid S,T\right]$ for $\mathbb{E}\left[\cdot\mid\sigma\in S,\tau\in T\right]$ in such cases.

2.2 Agreement Protocols

The notion of agreement between Alice and Bob is central to our work. We first define agreement in terms of squared error, and generalize to other error measures in Section 4.

Definition 2.1 ( $\epsilon$ -agree).

Let $a$ and $b$ be Alice’s and Bob’s expectations, respectively ( $a$ and $b$ are random variables on $\Omega$ ). Alice and Bob $\epsilon$ -agree if $\frac{1}{4}\mathbb{E}\left[(a-b)^{2}\right]\leq\epsilon$ .

The constant $\frac{1}{4}$ makes the left-hand side represent Alice’s and Bob’s distance to the average of their expectations.

Our setting follows [Aar05], which examined communication protocols that cause Alice and Bob to agree. In a (deterministic) communication protocol, Alice and Bob take turns sending each other messages. On Alice’s turns, Alice communicates a message that is a deterministic function of her input (i.e. her signal $\sigma$ ) and all previous communication, and likewise for Bob on his turns. A rectangle is a set of the form $S\times T$ where $S\subseteq\mathcal{S}$ and $T\subseteq\mathcal{T}$ . The transcript of the protocol at a time step $t$ (i.e. after $t$ messages have been sent) partitions $\Omega$ into rectangles: for any given sequence of $t$ messages, there are subsets $S_{t}\subseteq\mathcal{S},T_{t}\subseteq\mathcal{T}$ such that the protocol transcript at time $t$ is equal to this sequence if and only if $(\sigma,\tau)\in S_{t}\times T_{t}$ . For a given communication protocol, we may think of $S_{t}$ and $T_{t}$ as random variables. Alice’s expectation at time $t$ (i.e. after the $t$ -th message has been sent) is $\mu_{\sigma T_{t}}$ and Bob’s expectation at time $t$ is $\mu_{S_{t}\tau}$ . Finally, the protocol terminates at a certain time (which need not be known in advance of the protocol). While typically in communication complexity a protocol is associated with a final output, in this case we are interested in Alice’s and Bob’s expectations, so we do not require an output.

It will be convenient to hypothesize a third party observer, whom we call Charlie, who observes the protocol but has no other information. At time $t$ , Charlie has expectation $\mu_{S_{t}T_{t}}$ . Charlie’s expectation can also be interpreted as the expectation of $Y$ according to Alice and Bob’s common knowledge.

The following definition formalizes the relationship between communication protocols and agreement.

Definition 2.2 ( $\epsilon$ -agreement protocol).

Given an information structure $\mathcal{I}$ , a communication protocol causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ if Alice and Bob $\epsilon$ -agree at the end of the protocol, i.e., if $\frac{1}{4}\mathbb{E}\left[(a-b)^{2}\right]\leq\epsilon$ , where the expected value is over Alice’s and Bob’s inputs. We say that a communication protocol is an $\epsilon$ -agreement protocol if the protocol causes Alice and Bob to $\epsilon$ -agree on every information structure.

Aaronson defines and analyzes two $\epsilon$ -agreement protocols.³³3A minor difference to our framing is that [Aar05] focuses on probable approximate agreement: protocols that cause the absolute difference between Alice and Bob to be at most $\epsilon$ with probability all but $\delta$ . While the results as presented in this section are stronger than those in [Aar05] (the original results follow from these as a consequence of Markov’s inequality), these results follow from a straightforward modification of his proofs. The first of these is the standard protocol, in which Alice and Bob take turns stating their expectations for a number of time steps that can be computed by Alice and Bob independently in advance of the protocol, and which is guaranteed to be at most $O(1/\epsilon)$ .

The fact that exchanging their expectations for $O(1/\epsilon)$ time steps results in $\epsilon$ -agreement is profound and compelling. However, the standard protocol may require an unbounded number of bits of communication, since Alice and Bob are exchanging real numbers. To address this, Aaronson defines another agreement protocol that is truly polynomial-communication (which we slightly modify for our purposes):

Definition 2.3 (Discretized protocol, [Aar05]).

Choose $\epsilon>0$ . In the discretized protocol with parameter $\epsilon$ , on her turn (at time $t$ ), Alice sends “low” if her expectation is smaller than Charlie’s by more than $\epsilon/4$ , i.e. if $\mu_{S_{t-1}\tau}<\mu_{S_{t-1}T_{t-1}}-\epsilon/4$ ; “high” if her expectation is larger than Charlie’s by more than $\epsilon/4$ ; and “medium” otherwise. Bob acts analogously on his turn. At the start of the protocol, Alice and Bob use the information structure to independently compute the time $t_{\text{end}}\leq\frac{1000}{\epsilon}$ that minimizes $\mathbb{E}\left[(\mu_{\sigma T_{t_{\text{end}}}}-\mu_{S_{t_{\text{end}}}\tau})^{2}\right]$ . The protocol ends at this time.

Theorem 2.4 ([Aar05, Theorem 4]).

The discretized protocol with parameter $\epsilon$ is an $\epsilon$ -agreement protocol with transcript length $O(1/\epsilon)$ bits.

In general, we refer to Aaronson’s standard and discretized protocols as examples of expectation-sharing protocols. We will define other examples in Section 4, similar to Aaronson’s discretized protocol but with different cutoffs for low, medium, and high. We also interpret expectation-sharing protocols in the context of markets in Section 5.

2.3 Accuracy and Informational Substitutes

Most of our main results give conditions such that if Alice and Bob $\epsilon$ -agree, then Alice’s and Bob’s estimates are accurate. By accurate, we mean that Alice’s and Bob’s expectations are close to $\mu_{\sigma\tau}$ , i.e., what they would believe if they knew each other’s signals. (After all, they cannot hope to have a better estimate of $Y$ than $\mu_{\sigma\tau}$ ; for this reason we sometimes refer to $\mu_{\sigma\tau}$ as the “truth.”) Formally:

Definition 2.5 ( $\epsilon$ -accurate).

Let $a$ be Alice’s expectation. Alice is $\epsilon$ -accurate if $\mathbb{E}\left[(\mu_{\sigma\tau}-a)^{2}\right]\leq\epsilon$ . We define $\epsilon$ -accuracy analogously for Bob.

One cannot hope for an unconditional result stating that if Alice and Bob agree, then they are accurate. Consider for instance the XOR information structure from the introduction: Alice and Bob each receive independent random bits as input, and $Y$ is the XOR of these bits. Then from the start Alice and Bob agree that the expected value of $Y$ is exactly $\frac{1}{2}$ , but this value is far from $\mu_{\sigma\tau}$ , which is either $0$ or $1$ .

Intuitively, this situation arises because Alice’s and Bob’s signals are informational complements: each signal is not informative by itself, but they are informative when taken together. On the other hand, we say that signals are informational substitutes if learning one signal is less valuable if you already know the other signal. An extreme example is if $\sigma=\tau=X$ for any random variable $X$ . Here $\sigma$ becomes useless upon learning $\tau$ and vice versa. In [CW16],⁴⁴4We recommend the ArXiv version for the most up-to-date introduction to informational substitutes. the authors discuss formalizations of several notions of informational substitutes. All of these notions capture “diminishing marginal value,” in the sense that, roughly speaking, the value of partial information is a submodular set function. The various definitions proposed by [CW16] only differ in how finely they allow decomposing $\sigma$ and $\tau$ to obtain a marginal unit. Our definition has the same format, but uses a decomposition inspired by information rectangles in communication complexity. Recall that we write $\mid S,T$ as shorthand for $\mid\sigma\in S,\tau\in T$ .

Definition 2.6.

An information structure $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ satisfies rectangle substitutes if it satisfies weak substitutes on every sub-rectangle, i.e., if for every $S\subseteq\mathcal{S},T\subseteq\mathcal{T}$ , we have

\displaystyle\mathbb{E}\left[(Y-\mu_{S\tau})^{2}\mid S,T\right]-\mathbb{E}\left[(Y-\mu_{\sigma\tau})^{2}\mid S,T\right]\leq\mathbb{E}\left[(Y-\mu_{ST})^{2}\mid S,T\right]-\mathbb{E}\left[(Y-\mu_{\sigma T})^{2}\mid S,T\right].

We will show that under rectangle substitutes, if Alice and Bob approximately agree, then they are approximately accurate.

Interpreting substitutes.

Both sides of the inequality in Definition 2.6 represent the “value” of learning $\sigma$ as measured by a decrease in error. The left-hand side gives the decrease if one already knows $\tau$ and that $\sigma\in S$ ; the right-hand side gives the decrease if one only knows that $\sigma\in S,\tau\in T$ . Substitutes thus says: the marginal value of learning $\sigma$ is smaller if one already knows $\tau$ than if one does not. This statement should hold for every sub-rectangle $S,T$ . We remark that the inequality can be rearranged to focus instead on the marginal value of $\tau$ rather than $\sigma$ . We also note that in the XOR information structure, the left-hand side of the inequality is $\frac{1}{4}$ while the right-hand side is zero: a large violation of the substitutes condition. In the example $\sigma=\tau=X$ , the left side is always zero.

[CW16] discusses three interpretations of substitutes, which motivate it as a natural condition. (1) Each side of the inequality measures an improvement in prediction error, here the squared loss, due to learning $\sigma$ . Under substitutes, the improvement is smaller if one already knows $\tau$ . (2) Each side measures a decrease in uncertainty due to learning $\sigma$ . Under substitutes, $\sigma$ provides less information about $Y$ if one already knows $\tau$ .⁵⁵5Here, uncertainty is measured by variance of one’s belief. Under the KL divergence analogue covered in Section 4.1, uncertainty is measured in bits via Shannon entropy. (3) Each side measures the decrease in distance of a posterior expectation from the truth when learning $\sigma$ . The distance to $Y$ changes less if one already knows $\tau$ .

2.4 The Pythagorean Theorem

We will use the following fact throughout. We defer the proof to Appendix C, where we establish a more general version of this statement.

Proposition 2.7 (Pythagorean theorem).

Let $A$ be a random variable, $B=\mathbb{E}\left[A\mid\mathcal{F}\right]$ where $\mathcal{F}$ is a sigma-algebra, and $C$ be a random variable defined on $\mathcal{F}$ . Then

\mathbb{E}\left[(A-C)^{2}\right]=\mathbb{E}\left[(A-B)^{2}\right]+\mathbb{E}\left[(B-C)^{2}\right].

We use the phrase Pythagorean theorem in part because of its form, and in part because it is precisely the familiar Pythagorean theorem when the random variables are viewed as points in a Hilbert space⁶⁶6We do not make use of this abstraction in our work, but we refer the interested reader to [Šid57]. with inner product $\langle X,Y\rangle:=\mathbb{E}\left[XY\right]$ .

Informally, $A$ is a random variable, $B$ is the expected value of $A$ conditional on some partial information, and $C$ is a random variable that only depends on this information. So the theorem applies when $B$ is a coarse estimate of $A$ and $C$ is at least as coarse as $B$ , a scenario that often occurs in our setting.

One application of the Pythagorean theorem in our context takes $A=Y$ , $B=\mu_{\sigma\tau}$ (the expected value of $Y$ conditioned on the experts’ signals), and $C=\mu_{\sigma T}$ (Alice’s expected value, which only depends on her signal and thus on the signal pair). This particular application, along with the symmetric one taking $C=\mu_{S\tau}$ , allows us to rewrite the rectangle substitutes condition in a form that we will find more convenient:

Remark 2.8.

An information structure $\mathcal{I}$ satisfies rectangle substitutes if and only if

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\leq\mathbb{E}\left[(\mu_{\sigma T}-\mu_{ST})^{2}\mid S,T\right]

(1)

for all $S,T$ .

3 Results for Squared Distance

Our main results show that, under the rectangle substitutes condition, any communication protocol that causes Alice and Bob to agree also causes them to be accurate. We now show the first of these results, which is specific to the squared distance error measure that we have been discussing.

3.1 Agreement Implies Accuracy

Theorem 3.1.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies rectangle substitutes. For any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , Alice and Bob are $10\epsilon^{1/3}$ -accurate after the protocol terminates.

The crux of the argument is the following lemma.

Lemma 3.2.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies rectangle substitutes. Let $\epsilon=\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]$ . Then

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq 6\epsilon^{1/3}.

Let us first prove Theorem 3.1 assuming Lemma 3.2 is true.

Proof of Theorem 3.1.

Consider any protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ . Let $S$ be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define $T$ likewise for Bob. Intuitively, $S\times T$ is the set of plausible signal pairs $(\sigma,\tau)$ according to an external observer of the protocol. Observe that $S$ and $T$ are random variables, each a function of both $\sigma$ and $\tau$ . We have

	$\displaystyle\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]$	$\displaystyle=\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\right]$
		$\displaystyle\leq\mathbb{E}_{S,T}\left[6\left(\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right)^{1/3}\right]$
		$\displaystyle\leq 6\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right]^{1/3}$
		$\displaystyle=6\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\right]^{1/3}=6(4\epsilon)^{1/3}\leq 10\epsilon^{1/3}.$

In the second step, we apply Lemma 3.2 to the information structure $\mathcal{I}$ restricted to $S\times T$ — that is, to $\mathcal{I}^{\prime}=(\Omega^{\prime},\mathbb{P}^{\prime},S,T,Y)$ , where $\Omega^{\prime}=\{\omega\in\Omega:\sigma\in S,\tau\in T\}$ and $\mathbb{P}^{\prime}[\omega]=\mathbb{P}\left[\omega\mid\sigma\in S,\tau\in T\right]$ . (Note that we use the fact that if $\mathcal{I}$ satisfies rectangle substitutes, then so does $\mathcal{I}^{\prime}$ ; this is because a rectangle of $\mathcal{I}^{\prime}$ is also a rectangle of $\mathcal{I}$ .) The third step follows by the concavity of $x^{1/3}$ . Therefore, Bob is $10\epsilon^{1/3}$ accurate (and Alice is likewise by symmetry). ∎

The proof of Lemma 3.2 relies on the following claim. We defer the proof of Lemma 3.2 (and Claim 3.3) to Appendix B, and instead sketch the proofs here.

Claim 3.3.

For any $N\geq 1$ , it is possible to partition $[0,1]$ into $N$ intervals $[0,x_{1}),[x_{1},x_{2}),\dots,$ $[x_{N-1},1]$ in a way so that each interval has length at most $\frac{2}{N}$ , and

\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sqrt{\epsilon}N,

where $k(\sigma)$ denotes the $k\in[N]$ such that $x_{k-1}\leq\mu_{\sigma}<x_{k}$ , and $k(\tau)$ is defined analogously.⁷⁷7For convenience we define $x_{0}=0$ and $x_{N}$ to be some number greater than $1$ .

Intuitively, Claim 3.3 is true because if $\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]$ is small, then $\mu_{\sigma}$ and $\mu_{\tau}$ are likely to fall into the same interval.

We now sketch the proof of Lemma 3.2. To see why Claim 3.3 is relevant, recall that we wish to upper bound the expectation of $(\mu_{\sigma\tau}-\mu_{\tau})^{2}$ . Let $S^{(k)}:=\{\sigma\in\mathcal{S}:x_{k-1}\leq\mu_{\sigma}<x_{k}\}$ . By the Pythagorean theorem, we have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

By using the rectangle substitutes condition for $S=S^{(k)},T=\mathcal{T}$ for every $k$ , we find that

\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right].

(2)

Therefore, we have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

(3)

Claim 3.3 lets us argue that the first of these two terms is small (because $\mu_{\sigma}$ and $\mu_{S^{(k(\sigma))}}$ are always within $\frac{2}{N}$ of each other) and that the second term is also small (because conditioned on $\tau$ , $k(\sigma)$ is known with high probability). We find that choosing $N=\epsilon^{1/6}$ gives us the bound in Lemma 3.2.

Theorem 3.1 is a general result about agreement protocols. Applying the result to Aaronson’s discretized protocol gives us the following result.

Corollary 3.4.

Let $\mathcal{I}$ be any information structure that satisfies universal rectangle substitutes. For any $\epsilon>0$ , Alice and Bob will be $\epsilon$ -accurate after running Aaronson’s discretized protocol with parameter $\epsilon^{3}/1000$ (and this takes $O(1/\epsilon^{3})$ bits of communication).

Remark 3.5.

The discretized protocol is not always the most efficient agreement protocol. For example, Proposition B.1 shows that if the rectangle substitutes condition holds, agreement (and therefore accuracy) can be reached with just $O(\log(1/\epsilon))$ bits, an improvement on Corollary 3.4. We discuss communication complexity further in Appendix E. Even if more efficient protocols are sometimes possible, expectation-sharing protocols are of interest because they model naturally-occurring communication processes. For example, they capture the dynamics of prices in markets, which we also discuss in Section 5. More generally, we find it remarkable that Alice and Bob become accurate by running the agreement protocol (indeed any agreement protocol), despite such protocols being designed with only agreement in mind.

Finally, we observe the following important consequence of Theorem 3.1: once Alice and Bob agree, they continue to agree.

Corollary 3.6.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies rectangle substitutes. Consider a communication protocol with the property that Alice and Bob $\epsilon$ -agree after round $t$ . Then Alice and Bob $10\epsilon^{1/3}$ -agree on all subsequent time steps.

Proof.

If Alice and Bob $\epsilon$ -agree then they are $10\epsilon^{1/3}$ -accurate, so in particular $\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{t}})^{2}\right]\leq 10\epsilon^{1/3}$ . Note that $\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s}})^{2}\right]$ is a decreasing function of $s$ , since for any $s_{1}\leq s_{2}$ we have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s_{1}}})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s_{2}}})^{2}\right]+\mathbb{E}\left[(\mu_{\sigma T_{s_{2}}}-\mu_{\sigma T_{s_{1}}})^{2}\right]

by the Pythagorean theorem. Therefore, for any $t^{\prime}>t$ , we have $\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{t^{\prime}}})^{2}\right]\leq 10\epsilon^{1/3}$ . Symmetrically, we have $\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S_{t^{\prime}}\tau})^{2}\right]\leq 10\epsilon^{1/3}$ . Therefore, $\mathbb{E}\left[(\mu_{\sigma T_{t^{\prime}}}-\mu_{S_{t^{\prime}}\tau})^{2}\right]\leq 40\epsilon^{1/3}$ , which means that after round $t^{\prime}$ , Alice and Bob $10\epsilon^{1/3}$ -agree. ∎

Corollary 3.6 stands in contrast to the more general case, in which it is possible that Alice and Bob “nearly agree for the first $t-1$ time steps, then disagree violently at the $t$ -th step” [Aar05, §2.2]. Thus, while the main purpose of Theorem 3.1 is a property about accuracy, an agreement property falls out naturally: under the rectangle substitutes condition, once Alice and Bob are close to agreement, they will remain in relatively close agreement into the future.

3.2 Graceful Decay Under Closeness to Rectangle Substitutes

In a sense, the rectangle substitutes condition is quite strong: it requires that the weak substitutes condition be satisfied on every sub-rectangle. One might hope for a result that generalizes Theorem 3.1 to information structures that almost satisfy the rectangle substitutes condition but do not quite. Let us formally define a notion of closeness to rectangle substitutes.

Definition 3.7.

An information structure $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ satisfies $\delta$ -approximate rectangle substitutes if for every partition of $\mathcal{S}\times\mathcal{T}$ into rectangles,⁸⁸8There are partitions into rectangles that cannot arise from a communication protocol. Our results would apply equally if this condition were instead defined for every partition that could arise from a communication protocol, but we state this condition more generally so that it could be applicable in a broader context than the analysis of communication protocols. the rectangle substitutes condition holds in expectation over the partition, up to an additive constant of $\delta$ , i.e., if we have

\mathbb{E}_{\sigma,\tau}\left[(\mu_{\sigma\tau}-\mu_{S_{\sigma,\tau}\tau})^{2}\right]\leq\mathbb{E}_{\sigma,\tau}\left[(\mu_{\sigma T_{\sigma,\tau}}-\mu_{S_{\sigma,\tau}T_{\sigma,\tau}})^{2}\right]+\delta,

(4)

where $S_{\sigma,\tau}\times T_{\sigma,\tau}$ is the rectangle containing $(\sigma,\tau)$ .

Remark 3.8.

The $\delta$ -approximate rectangle substitutes property is a relaxation of the rectangle substitutes property, in the sense that the two are equivalent if $\delta=0$ . To see this, first observe that if $\mathcal{I}$ satisfies rectangle substitutes, then it satisfies Equation 4 with $\delta=0$ pointwise across all $S_{\sigma,\tau},T_{\sigma,\tau}$ , and thus in expectation. In the other direction, suppose that $\mathcal{I}$ satisfies $0$ -approximate rectangle substitutes. Let $S\subseteq\mathcal{S},T\subseteq\mathcal{T}$ and consider the partition of $\mathcal{I}$ into rectangles that contains $S\times T$ and, separately, every other signal pair $(\sigma,\tau)$ in its own rectangle. For this partition, Equation 4 reduces precisely to Equation 1 (the rectangle substitutes condition for $S$ and $T$ ).

Theorem 3.1 generalizes to approximate rectangle substitutes as follows.

Theorem 3.9.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies $\delta$ -approximate rectangle substitutes. For any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , Alice and Bob are $(10\epsilon^{1/3}+\delta)$ -accurate after the protocol terminates.

Proof.

We first observe that Lemma 3.2 can be modified as follows.

Lemma 3.10.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies $\delta$ -approximate rectangle substitutes. Let $\epsilon=\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]$ . Then

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq 6\epsilon^{1/3}+\delta.

The proof of Lemma 3.10 is exactly the same as that of Lemma 3.2, except that Equation 2 (Equation 6 in the full proof) includes an additive $\delta$ term on the left-hand side:

\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\delta\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right].

This modified inequality follows immediately from the $\delta$ -approximate rectangle substitutes condition, noting that one partition of $\mathcal{S}\times\mathcal{T}$ into rectangles is $\{S_{1}\times\mathcal{T},\dots,S_{N}\times\mathcal{T}\}$ . The extra $\delta$ term produces the $\delta$ term in the lemma statement.

To prove the theorem, let $S$ be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define $T$ likewise for Bob. Let $\delta_{ST}$ be the minimum $\delta$ such that $S\times T$ satisfies $\delta$ -approximate rectangle substitutes. Note that $\mathbb{E}_{S,T}\left[\delta_{ST}\right]\leq\delta$ : otherwise, by taking the union over the worst-case partitions for each $S,T$ we would exhibit a partition of $\mathcal{S}\times\mathcal{T}$ into rectangles that would violate the $\delta$ -approximate rectangle substitutes property. Therefore we have

	$\displaystyle\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]$	$\displaystyle=\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\right]$
		$\displaystyle\leq\mathbb{E}_{S,T}\left[6\left(\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right)^{1/3}+\delta_{ST}\right]$
		$\displaystyle\leq 6\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right]^{1/3}+\delta$
		$\displaystyle=6\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\right]^{1/3}+\delta=6(4\epsilon)^{1/3}+\delta\leq 10\epsilon^{1/3}+\delta.$

As in the proof of Theorem 3.1, the second step follows by applying Lemma 3.2 to the information structure $\mathcal{I}$ restricted to $S\times T$ . ∎

4 Results for Other Divergence Measures

Squared distance is a compelling error measure because it elicits the mean. That is, if you wish to estimate a random variable $Y$ and will be penalized according to the squared distance between $Y$ and your estimate, the strategy that minimizes your expected penalty is to report the expected value of $Y$ (conditional on the information you have). This is in contrast to e.g. absolute distance as an error measure, which would instead elicit the median of your distribution. The class of error measures that elicit the mean is precisely the class of Bregman divergences (defined below).

In this section, our main result is a generalization of Theorem 3.1 to (almost) arbitrary Bregman divergences (see e.g. Theorem 4.14). Additionally, we provide a generalization of Aaronson’s discretized protocol to arbitrary Bregman divergences (Theorem 4.11).

4.1 Preliminaries on Bregman Divergences

Definition 4.1.

Given a differentiable,⁹⁹9When we say “differentiable,” we mean differentiable on the interior of the interval on which $G$ is defined. strictly convex function $G:[0,1]\to\mathbb{R}$ , and $x,y\in[0,1]$ , the Bregman divergence from $y$ to $x$ is

D_{G}(y\parallel x):=G(y)-G(x)-(y-x)G^{\prime}(x).

Figure 1: The Bregman divergence

D_{G}(y\parallel x)

is the vertical distance at

y

between

G

and the tangent line to

G

x

Proposition 4.2 ([Ban+05]).

Given a random variable $Y$ , the quantity $\mathbb{E}\left[D_{G}(Y\parallel x)\right]$ is minimized by $x=\mathbb{E}\left[Y\right]$ .

An intuitive formulation of Bregman divergence is that $D_{G}(y\parallel x)$ can be found by drawing the line tangent to $G$ at $x$ and computing how far below the point $(y,G(y))$ this line passes. We illustrate this in Figure 1. Note that the Bregman divergence is not in general symmetric in its arguments; indeed, $G(x)=x^{2}$ is the only $G$ for which it is.

The Bregman divergence with respect to $G(x)=x^{2}$ is precisely the squared distance. Another common Bregman divergence is the KL divergence, which corresponds to $G(x)=x\log x+(1-x)\log(1-x)$ , the negative of Shannon entropy.

We generalize relevant notions such as agreement and accuracy to arbitrary Bregman divergences as follows. In the definitions below, $G:[0,1]\to\mathbb{R}$ is a differentiable, strictly convex function.

Definition 4.3.

Let $a$ be Alice’s expectation. Alice is $\epsilon$ -accurate if $\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel a)\right]\leq\epsilon$ , and likewise for Bob.

We discuss our choice of the order of these two arguments (i.e. why we do not instead consider the expectation of $D_{G}(a\parallel\mu_{\sigma\tau})$ ) in Appendix D. We now define $\epsilon$ -agreement, and to do so we first define the Jensen-Bregman divergence.

Definition 4.4.

For $a,b\in[0,1]$ , the Jensen-Bregman divergence between $a$ and $b$ with respect to $G$ is

\text{JB}_{G}(a,b):=\frac{1}{2}\left(D_{G}\left(a\parallel\frac{a+b}{2}\right)+D_{G}\left(b\parallel\frac{a+b}{2}\right)\right)=\frac{G(a)+G(b)}{2}-G\left(\frac{a+b}{2}\right).

The validity of the second equality can be easily derived from the definition of Bregman divergence. Note that the Jensen-Bregman divergence, unlike the Bregman divergence, is symmetric in its arguments. The Jensen-Bregman divergence is a lower bound on the average Bregman divergence from Alice and Bob to any other point (see Proposition C.1 (i)).

Definition 4.5.

Let $a$ and $b$ be Alice’s and Bob’s expectations, respectively. Alice and Bob $\epsilon$ -agree with respect to $G$ if $\text{JB}_{G}(a,b)\leq\epsilon$ .

In Appendix D we discuss alternative definitions of agreement and accuracy. The upshot of this discussion is that our definition of agreement is the weakest reasonable one, and our definition of accuracy is the strongest reasonable one. This means that the main result of this section—that under a wide class of Bregman divergence, agreement implies accuracy—is quite powerful: it starts with a weak premise and proves a strong conclusion.

Definition 4.6.

Given an information structure $\mathcal{I}$ , a communication protocol causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ with respect to $G$ if Alice and Bob $\epsilon$ -agree with respect to $G$ at the end of the protocol. A communication protocol is an $\epsilon$ -agreement protocol with respect to $G$ if the protocol causes Alice and Bob to $\epsilon$ -agree with respect to $G$ on every information structure.

We also generalize the notion of rectangle substitutes to this domain, following [CW16], which explored notions of substitutes for arbitrary Bregman divergences.

Definition 4.7.

Let $G:[0,1]\to\mathbb{R}$ be a differentiable, strictly convex function. An information structure $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ satisfies rectangle substitutes with respect to $G$ if for every $S\subseteq\mathcal{S},T\subseteq\mathcal{T}$ , we have

	$\displaystyle\mathbb{E}\left[D_{G}(Y\parallel\mu_{S\tau})\mid S,T\right]-\mathbb{E}\left[D_{G}(Y\parallel\mu_{\sigma\tau})\mid S,T\right]$
	$\displaystyle\leq\mathbb{E}\left[D_{G}(Y\parallel\mu_{ST})\mid S,T\right]-\mathbb{E}\left[D_{G}(Y\parallel\mu_{\sigma T})\mid S,T\right].$

The Pythagorean theorem (Proposition 2.7) generalizes to arbitrary Bregman divergences:

Proposition 4.8.

Let $A$ be a random variable, $B=\mathbb{E}\left[A\mid\mathcal{F}\right]$ where $\mathcal{F}$ is a sigma-algebra, and $C$ be a random variable defined on $\mathcal{F}$ . Then

\mathbb{E}\left[D_{G}(A\parallel C)\right]=\mathbb{E}\left[D_{G}(A\parallel B)\right]+\mathbb{E}\left[D_{G}(B\parallel C)\right].

Although the proof of this observation is fairly straightforward, to our knowledge Proposition 4.8 is original to this work. We provide a proof in Appendix C. Just as we did with squared error, this general Pythagorean theorem allows us to rewrite the rectangle substitutes condition for Bregman divergences.

Remark 4.9.

An information structure $\mathcal{I}$ satisfies rectangle substitutes with respect to $G$ if and only if for all $S\subseteq\mathcal{S},T\subseteq\mathcal{T}$ we have

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\mid S,T\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\mid S,T\right].

(5)

Given the interpretation of Bregman divergences as measures of error, we can interpret the left side as Bob’s expected error in predicting the truth while the right side is Charlie’s expected error when predicting Alice’s expectation. Both sides measure a prediction error due to not having Alice’s signal, but from different starting points.

4.2 Generalizing the Discretized Protocol

Later in this work, we will show that under some weak conditions, protocols that cause Alice and Bob to agree with respect to $G$ also cause Alice and Bob to be accurate with respect to $G$ . However, this raises an interesting question: are there protocols that cause Alice and Bob to agree with respect to $G$ ? In particular, we are interested in natural expectation-sharing protocols. Aaronson’s discretized protocol is specific to $G(x)=x^{2}$ , and it is not immediately obvious how to generalize it. We present the following generalization.

Definition 4.10.

Let $G$ be a differentiable, strictly convex function, and let $M:=\max_{x}G(x)-\min_{x}G(x)$ . Choose $\epsilon>0$ . In the discretized protocol with respect to $G$ with parameter $\epsilon$ , on her turn (at time $t$ ), Alice sends “medium” if $D_{G}(\mu_{\sigma T_{t-1}}\parallel\mu_{S_{t-1}T_{t-1}})<\frac{\epsilon}{2}$ , and otherwise either “low” or “high”, depending on whether $\mu_{\sigma T_{t}}$ is smaller or larger (respectively) than $\mu_{S_{t}T_{t}}$ . Bob acts analogously on his turn. At the start of the protocol, Alice and Bob use the information structure to independently compute the time $t_{\text{end}}\leq\frac{24M(4M+\epsilon)}{\epsilon^{2}}$ that minimizes $\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t_{\text{end}}}}\parallel\mu_{S_{t_{\text{end}}}\tau})\right]$ . The protocol ends at this time.

Theorem 4.11.

The discretized protocol with respect to $G$ with parameter $\epsilon$ is an $\epsilon$ -agreement protocol that involves $O\left(\frac{M(M+\epsilon)}{\epsilon^{2}}\right)$ bits of communication.

Our proof draws inspiration from Aaronson’s proof of the discretized protocol, but has significant differences. The key idea is to keep track of the monovariant $\mathbb{E}\left[D_{G}(Y\parallel\mu_{S_{t}T_{t}})\right]$ . This is Charlie’s expected error (as measured by the Bregman divergence from the correct answer $Y$ ) after time step $t$ —recall that Charlie is our name for a third-party observer of the protocol. Note that this quantity is at most $M$ and at least $0$ . Hence, if we show that the quantity decreases by at least some value $\beta$ every time Alice and Bob do not $\epsilon$ -agree, then we will have shown that Alice and Bob must $\epsilon$ -agree within $\frac{\beta}{M}$ time steps. We defer the proof to Appendix C.

4.3 Approximate Triangle Inequality

Our results will hold for a class of Jensen-Bregman divergences that satisfy an approximate version of the triangle inequality. Specifically, we will require $\text{JB}_{G}$ to satisfy the following $c$ -approximate triangle inequality for some $c>0$ .

Definition 4.12.

Given a differentiable, strictly convex function $G:[0,1]\to\mathbb{R}$ and a positive number $c$ , we say that $\text{JB}_{G}(\cdot,\cdot)$ satisfies the $c$ -approximate triangle inequality if for all $a,b,x\in[0,1]$ we have

\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\geq c\text{JB}_{G}(a,b).

It is possible to construct functions $G$ such that there is no positive $c$ for which $\text{JB}_{G}$ satisfies the $c$ -approximate triangle inequality. However, $\text{JB}_{G}$ satisfies the $c$ -approximate triangle inequality for some positive $c$ for essentially all natural choices of $G$ .

Proposition 4.13.

Let $G:[0,1]\to\mathbb{R}$ be a differentiable, strictly convex function.

(i)

If $\sqrt{\text{JB}_{G}(\cdot,\cdot)}$ satisfies the triangle inequality, then $\text{JB}_{G}$ satisfies the $\frac{1}{2}$ -approximate triangle inequality.
(ii)

If $G(x)=x^{2}$ (i.e. $D_{G}$ is squared distance) or if $G(x)=x\log x+(1-x)\log(1-x)$ (i.e. $D_{G}$ is KL divergence), then $\sqrt{\text{JB}_{G}}$ satisfies the triangle inequality (and so $\text{JB}_{G}$ satisfies the $\frac{1}{2}$ -approximate triangle inequality).

Proof.

Regarding Fact (i), suppose that $\sqrt{\text{JB}_{G}}$ satisfies the triangle inequality. Then for all $a,b,x$ we have $\sqrt{\text{JB}_{G}(a,x)}+\sqrt{\text{JB}_{G}(x,b)}\geq\sqrt{\text{JB}_{G}(a,b)}$ . Squaring both sides and observing that $\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\geq 2\sqrt{\text{JB}_{G}(a,x)\text{JB}_{G}(x,b)}$ completes the proof.

Fact (ii) is trivial for $G(x)=x^{2}$ , since $\sqrt{\text{JB}_{G}}$ is the absolute distance metric (times a constant factor). As for $G(x)=x\log x+(1-x)\log(1-x)$ , we refer the reader to [ES03]. ∎

The question of when $\sqrt{\text{JB}_{G}}$ satisfies the triangle inequality has been explored in previous work; we refer the interested reader to [ABB13] and [CCR08].

4.4 Generalized Agreement Implies Generalized Accuracy

In all of the results in this subsection, we consider the following setting: $G$ is a differentiable convex function; $c$ is a positive real number such that $\text{JB}_{G}$ satisfies the $c$ -approximate triangle inequality; and $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ is an information structure that satisfies rectangle substitutes with respect to $G$ .

We prove generalizations of Theorem 3.1, showing that under the rectangle substitutes condition, if a protocol ends with Alice and Bob in approximate agreement, then Alice and Bob are approximately accurate. The first result we state assumes that $G$ is symmetric, but is otherwise quite general.

Theorem 4.14.

Assume that $G$ is symmetric about the line $x=\frac{1}{2}$ . For any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , and for any $\beta\geq\frac{2}{c}\epsilon$ , Alice and Bob are

\left(\frac{8}{c^{2}}\beta+16\left(G(0)-G\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right)\right)\text{-accurate}

after the protocol terminates.

This result is not our most general, as it assumes that $G$ is symmetric, but this assumption likely holds for most use cases. To apply the result optimally, one must first optimize $\beta$ as a function of $G$ . For example, setting $\beta=\epsilon^{r/(r+1-\log_{2}c)}$ (with $r$ defined below) gives us the following corollary:¹⁰¹⁰10Corollary 4.15 as stated (without the symmetry assumption) is actually a corollary of Theorem 4.18.

Corollary 4.15.

Assume that $G(0)-G(x),G(1)-G(1-x)\leq O(x^{r})$ . For any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , Alice and Bob are $O\left(\epsilon^{r/(r+1-\log_{2}c)}\right)$ -accurate after the protocol terminates, where the constant hidden by $O(\cdot)$ depends on $G$ .

Remark 4.16.

Concretely, if $G^{\prime}$ is bounded then we can choose $r=1$ , in which case our bound simplifies to $O\left(\epsilon^{1/(2-\log_{2}c)}\right)$ . If instead we assume that $c=\frac{1}{2}$ (as is the case if $\sqrt{\text{JB}_{G}(\cdot,\cdot)}$ is a metric), then the bound is $O\left(\epsilon^{r/(r+2)}\right)$ . If both of these are true, as is the case for $G(x)=x^{2}$ , then the bound is $O(\epsilon^{1/3})$ , which recovers our result in Theorem 3.1.

For $G$ equal to the negative of Shannon entropy (i.e. the $G$ for which $D_{G}$ is KL divergence), setting $\beta=\epsilon^{1/3}(\log 1/\epsilon)^{2/3}$ in Theorem 4.14 gives us the following corollary.

Corollary 4.17.

If $G(x)=x\log x+(1-x)\log(1-x)$ , then for any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , Alice and Bob are $O(\epsilon^{1/3}(\log 1/\epsilon)^{2/3})$ -accurate after the protocol terminates.

Theorem 4.14 follows from our most general result about agreement implying accuracy:

Theorem 4.18.

Let $\tilde{G}(x):=\max_{a,b:\left\lvert a-b\right\rvert\leq x}(G(a)-G(b))$ be the maximum possible difference in $G$ -values of two points that differ by at most $x$ , and let $\tilde{G}^{*}(x)$ be the concave envelope of $\tilde{G}$ , i.e.

\tilde{G}^{*}(x):=\max_{0\leq a,b,w\leq 1:wa+(1-w)b=x}w\tilde{G}(a)+(1-w)\tilde{G}(b).

For any communication protocol that causes Alice and Bob to $\epsilon$ -agree on $\mathcal{I}$ , and for any $\beta>0$ , Alice and Bob are

\left(\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right)\text{-accurate}

after the protocol terminates.

Proof.

To prove Theorem 4.18, it suffices to prove the following lemma.

Lemma 4.19.

Let $G$ be a differentiable convex function on $[0,1]$ and $c\in(0,1)$ be such that $\text{JB}_{G}$ satisfies the $c$ -approximate triangle inequality. Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies rectangle substitutes with respect to $G$ . Let $\epsilon=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma}\parallel\mu_{\tau})\right]$ . Then for any $\beta>0$ , we have

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right).

Let us first prove Theorem 4.18 assuming Lemma 4.19 is true.

Let $\epsilon_{ST}=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\mid S,T\right]$ . Note that

\mathbb{E}_{S,T}\left[\epsilon_{ST}\right]=\mathbb{E}_{S,T}\left[\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\mid S,T\right]\right]=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\right]\leq\epsilon.

Therefore, for any $\beta>0$ we have

	$\displaystyle\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\right]$	$\displaystyle\leq\mathbb{E}_{S,T}\left[\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon_{ST}}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right]$
		$\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\mathbb{E}_{S,T}\left[\left(\frac{\epsilon_{ST}}{\beta}\right)^{1/(1-\log_{2}c)}\right]\right)$
		$\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left({\left(\frac{\mathbb{E}_{S,T}\left[\epsilon_{ST}\right]}{\beta}\right)^{1/(1-\log_{2}c)}}\right)$
		$\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right).$

In the first step, we apply Lemma 4.19 to the information structure $\mathcal{I}$ restricted to $S\times T$ —that is, to $\mathcal{I}^{\prime}=(\Omega^{\prime},\mathbb{P}^{\prime},S,T,Y)$ , where $\Omega^{\prime}=\{\omega\in\Omega:\sigma\in S,\tau\in T\}$ and $\mathbb{P}^{\prime}[\omega]=\mathbb{P}\left[\omega\mid\sigma\in S,\tau\in T\right]$ . The next two steps follow by the convexity of $\tilde{G}^{*}$ and $x^{1/(1-\log_{2}c)}$ , respectively. ∎

The basic outline of the proof of Lemma 4.19 is similar to that of Lemma 3.2. Once again, we partition $[0,1]$ into $N$ intervals. Analogously to Equation 3, and with $S^{(k(\sigma))}$ defined analogously, we find that

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right].

As before, we wish to upper bound each summand. However, the fact that the Bregman divergence is now arbitrary introduces complications. First, it is no longer the case that we can directly relate the length of an interval to the Bregman divergence between its endpoints. Second, we consider functions $G$ that become infinitely steep near $0$ and $1$ (such as the negative of Shannon entropy), which makes matters more challenging. This means that we need to be more careful when partitioning $[0,1]$ into $N$ intervals: see Algorithm C.3 for our new approach. Additionally, bounding the second summand involves reasoning carefully about the behavior of the function $G$ , which is responsible for the introduction of $\tilde{G}^{*}$ into the lemma statement. We defer the full proof of Lemma 4.19 to Appendix C.

5 Connections to Markets

In this work, we established a natural condition on information structures, rectangle substitutes, under which any agreement protocol results in accurate beliefs. As we saw, a particularly natural class of agreement protocols are expectation-sharing protocols, where Alice and Bob take turns stating their current expected value, or discretizations thereof.

Expectation-sharing protocols have close connections to financial markets. In markets, the actions of traders reveal partial information about their believed value for an asset, i.e., their expectation. Specifically, a trader’s decision about whether to buy or sell, and how much, can be viewed as revealing a discretization of this expectation. In many theoretical models of markets (see e.g. [Ost12]) traders eventually reach agreement. The intuition behind this phenomenon is that a trader who disagrees with the price leaves money on the table by refusing to trade. Our work thus provides a lens into a well-studied question:¹¹¹¹11This is related to the efficient market hypothesis, the problem of when market prices reflect all available information, which traces back at least to [Fam70] and [Hay45]. Modern models of financial markets are often based on [Kyl85]; we refer the reader to e.g. [Ost12] and references therein for further information. when are market prices accurate?

An important caveat, however, is that traders behave strategically, and may not disclose their true expected value. For example, a trader may choose to withhold information until a later point when doing so would be more profitable. Therefore, to interpret the actions of traders as revealing discretized versions of their expected value, one first has to understand the Bayes-Nash equilibria of the market. [CW16] studies conditions under which traders are incentivized to reveal all of their information on their first trading opportunity. They call a market equilibrium all-rush if every trader is incentivized to reveal their information immediately. Their main result, roughly speaking, is that there is an all-rush equilibrium if and only if the information structure satisfies strong substitutes—a different strengthening of the weak substitutes condition. This result is specific to settings in which traders have the option to reveal all of their information on their turn—a setting that would be considered trivial from the standpoint of communication theory.

An exciting question for further study is therefore: under what information structure conditions and market settings is it a Bayes-Nash equilibrium to follow an agreement protocol leading to accurate beliefs? In other words, what conditions give not only that agreement implies accuracy, but also that the market incentivizes participants to follow the protocol? Together with [CW16], our work suggests that certain substitutes-like conditions could suffice.

References

[Aar05] Scott Aaronson “The complexity of agreement” In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005 ACM, 2005, pp. 634–643 DOI: 10.1145/1060590.1060686
[ABB13] Sreangsu Acharyya, Arindam Banerjee and Daniel Boley “Bregman Divergences and Triangle Inequality” In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA SIAM, 2013, pp. 476–484 DOI: 10.1137/1.9781611972832.53
[Aum76] Robert J. Aumann “Agreeing to Disagree” In The Annals of Statistics 4.6 Institute of Mathematical Statistics, 1976, pp. 1236–1239 URL: http://www.jstor.org/stable/2958591
[Ban+05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon and Joydeep Ghosh “Clustering with Bregman divergences” In Journal of Machine Learning Research 6, 2005, pp. 1705–1749
[CCR08] P. Chen, Y. Chen and M. Rao “Metrics defined by Bregman Divergences” In Communications in Mathematical Sciences 6.4 International Press of Boston, 2008, pp. 915–926 DOI: cms/1229619676
[CW16] Yiling Chen and Bo Waggoner “Informational Substitutes” In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA IEEE Computer Society, 2016, pp. 239–247 DOI: 10.1109/FOCS.2016.33
[ES03] Dominik Maria Endres and Johannes E. Schindelin “A new metric for probability distributions” In IEEE Trans. Inf. Theory 49.7, 2003, pp. 1858–1860 DOI: 10.1109/TIT.2003.813506
[Fam70] Eugene F. Fama “Efficient capital markets: A review of theory and empirical work” In The Journal of Finance 25.2 Blackwell Publishing for the American Finance Association, 1970, pp. 383–417
[Hay45] Friedrich August Hayek “The use of knowledge in society” In The American economic review 35.4 JSTOR, 1945, pp. 519–530
[Kyl85] Albert S. Kyle “Continuous auctions and insider trading” In Econometrica: Journal of the Econometric Society 53.6, 1985, pp. 1315–1335
[NR21] Eric Neyman and Tim Roughgarden “Are You Smarter Than a Random Expert? The Robust Aggregation of Substitutable Signals”, 2021
[NR21a] Eric Neyman and Tim Roughgarden “From Proper Scoring Rules to Max-Min Optimal Forecast Aggregation” In EC ’21: The 22nd ACM Conference on Economics and Computation, Budapest, Hungary, July 18-23, 2021 ACM, 2021, pp. 734 DOI: 10.1145/3465456.3467599
[Ost12] Michael Ostrovsky “Information aggregation in dynamic markets with strategic traders” In Econometrica 80.6 Wiley Online Library, 2012, pp. 2595–2647
[RY20] Anup Rao and Amir Yehudayoff “Communication Complexity: and Applications” Cambridge University Press, 2020
[Šid57] Z. Šidák “On Relations Between Strict-Sense and Wide-Sense Conditional Expectations” In Theory of Probability and Its Applications 2, 1957, pp. 267–272

Appendix A Details Omitted From Section 2

Above we stated that the weak substitutes condition is not sufficient for agreement to imply accuracy. To see this, consider the following information structure:

•

Nature flips a coin. Alice’s and Bob’s signals are each a pair consisting of the outcome of the coin flip and an additional bit:
•

If the coin lands heads, Alice and Bob are given highly correlated bits as part of their signals: the bits are $(0,0)$ and $(1,1)$ with probability 45% each and $(1,0)$ and $(0,1)$ with probability 5% each.
•

If the coin lands tails, Alice and Bob are given highly anticorrelated bits as part of their signals: the bits are $(0,0)$ and $(1,1)$ with probability 5% each and $(1,0)$ and $(0,1)$ with probability 45% each.

The value $Y$ is the XOR of Alice’s and Bob’s bits. It can be verified that this information structure satisfies weak substitutes; the intuition is that a majority of the value comes from knowing the outcome of the coin flip, which can be inferred from Alice’s (or Bob’s) signal alone.

Alice and Bob 0-agree at the very start. However, they are not 0-accurate, since their expectations are either 10% or 90%, and the right answer is either 0 or 1. Therefore, weak substitutes alone is insufficient for agreement to imply accuracy.

Appendix B Details Omitted From Section 3

Proof of Claim 3.3.

We claim that in fact we can choose the $x_{i}$ ’s so that each $x_{i}$ is in $\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right]$ . This ensures that each interval has length at most $\frac{2}{N}$ .

For $x\in[0,1]$ , let $\rho(x)$ be the probability that $x$ is between $\mu_{\sigma}$ and $\mu_{\tau}$ , inclusive. Note that $\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sum_{i=1}^{N-1}\rho(x_{i})$ .

Observe that if $x$ is selected uniformly from $[0,1]$ , the expected value of $\rho(x)$ is equal to $\left\lvert\mu_{\sigma}-\mu_{\tau}\right\rvert$ , because both quantities are equal to the probability that $x$ is between $\mu_{\sigma}$ and $\mu_{\tau}$ . Therefore, if $(\sigma,\tau)$ is additionally chosen according to $\mathbb{P}$ , we have

\mathbb{E}_{x\leftarrow[0,1]}\left[\rho(x)\right]=\mathbb{E}\left[\left\lvert\mu_{\sigma}-\mu_{\tau}\right\rvert\right]\leq\sqrt{\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]}=\sqrt{\epsilon}.

This means that

\sum_{i=1}^{N-1}\mathbb{E}_{x\leftarrow\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right]}\left[\rho(x)\right]=(N-1)\mathbb{E}_{x\leftarrow\left[\frac{1}{2N},1-\frac{1}{2N}\right]}\left[\rho(x)\right]\leq\sqrt{\epsilon}N.

Thus, if each $x_{i}$ is selected uniformly at random from $\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right]$ , the expected value of $\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]$ would be at most $\sqrt{\epsilon}N$ . In particular this means that there exist choices of the $x_{i}$ ’s such that $\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sqrt{\epsilon}N$ . ∎

Proof of Lemma 3.2.

Fix a large positive integer $N$ (we will later find it optimal to set $N=\epsilon^{-1/3}$ ). Consider a partition of $[0,1]$ into $N$ intervals $[0,x_{1}),[x_{1},x_{2}),\dots,[x_{N-1},1]$ satisfying the conditions of Claim 3.3. Let $S^{(k)}:=\{\sigma\in\mathcal{S}:x_{k-1}\leq\mu_{\sigma}<x_{k}\}$ . Additionally, let $k(\sigma)$ and $k(\tau)$ be as defined in Claim 3.3.

Our goal is to upper bound the expectation of $(\mu_{\sigma\tau}-\mu_{\tau})^{2}$ . In pursuit of this goal, we observe that by the Pythagorean theorem, we have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

We now use the rectangle substitutes assumption: for any $k$ , by applying Equation 1 to $S=S^{(k)}$ and $T=\mathcal{T}$ , we know that

\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k)}})^{2}\mid\sigma\in S^{(k)}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k)}\tau})^{2}\mid\sigma\in S^{(k)}\right].

Taking the expectation over $k$ (i.e. choosing each $k$ with probability equal to $\mathbb{P}\left[\sigma\in S^{(k)}\right]$ ), we have that

\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right].

(6)

Therefore, we have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

(7)

We will argue use Claim 3.3 to argue that each of these two summands is small. The argument regarding the first summand is straightforward: for any $\sigma$ , we have that $x_{k(\sigma)}\leq\mu_{\sigma},\mu_{S^{(k(\sigma))}}<x_{k(\sigma)+1}\leq x_{k(\sigma)}+\frac{2}{N}$ , which means that $\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\leq\frac{4}{N^{2}}$ .

We now upper bound the second summand.¹²¹²12The proof below takes sums over $\hat{\tau}\in\mathcal{T}$ and thus implicitly assumes that $\mathcal{T}$ is finite, but the proof extends to infinite $\mathcal{T}$ , with sums over $\tau$ replaced by integrals with respect to the probability measure over $\mathcal{T}$ . For any $\hat{\tau}\in\mathcal{T}$ , let $p(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau}\right]$ and $q(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau},k(\sigma)\neq k(\tau)\right]$ . Then $\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})=1$ and $\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})\leq\sqrt{\epsilon}N$ . Observe that

$\displaystyle\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right]$	$\displaystyle=\sum_{\hat{\tau}}p(\hat{\tau})\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau}\right]$
	$\displaystyle=\sum_{\hat{\tau}}(p(\hat{\tau})-q(\hat{\tau}))\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)=k(\hat{\tau})\right]$
	$\displaystyle\qquad+q(\hat{\tau})\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right].$	(8)

To handle the first expectation, we note that if $k(\sigma)=k(\tau)$ , then $\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}}\right\rvert\leq\frac{q(\hat{\tau})}{p(\hat{\tau})}$ . To see this, observe

p(\hat{\tau})\mu_{\hat{\tau}}=(p(\hat{\tau})-q(\hat{\tau}))\mu_{S^{(k(\sigma))}\hat{\tau}}+q(\hat{\tau})\mu_{\mathcal{S}\setminus S^{(k(\sigma))}\hat{\tau}}~{}.

Rerranging and taking absolute values, we conclude

p(\hat{\tau})\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}}\right\rvert=q(\hat{\tau})\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\mathcal{S}\setminus S^{(k(\sigma))}}\right\rvert\leq q(\hat{\tau}).

Therefore, recalling $q(\hat{\tau})\leq p(\hat{\tau})$ , we have

\displaystyle(p(\hat{\tau})-q(\hat{\tau}))\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)=k(\hat{\tau})\right]\leq(p(\hat{\tau})-q(\hat{\tau}))\left(\frac{q(\hat{\tau})}{p(\hat{\tau})}\right)^{2}\leq\frac{q(\hat{\tau})^{2}}{p(\hat{\tau})}\leq q(\hat{\tau}).

On the other hand, we can bound the second expectation in Equation B by $1$ . Therefore we have

\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right]\leq\sum_{\hat{\tau}}(q(\hat{\tau})+q(\hat{\tau}))=2\sum_{\hat{\tau}}q(\hat{\tau})\leq 2\sqrt{\epsilon}N.

by Claim 3.3. To conclude, we now know that

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\frac{4}{N^{2}}+2\sqrt{\epsilon}N.

Setting $N=\epsilon^{-1/6}$ makes the right-hand side equal to $6\epsilon^{1/3}$ , completing the proof. ∎

Proposition B.1.

Consider the following protocol, parametrized by $\epsilon>0$ . Alice and Bob send their initial expectations to each other, rounding to the nearest multiple of $\epsilon$ . This protocol entails communicating $O(\log 1/\epsilon)$ bits. At the end of the protocol, Alice and Bob $2\epsilon^{2}$ -agree and are $\epsilon^{2}$ -accurate (with respect to $G(x)=x^{2}$ ).

Proof.

Let $S$ be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define $T$ likewise for Bob. Recall that we use $\mathcal{S}$ and $\mathcal{T}$ to denote the sets of all of Alice’s and Bob’s possible signals, respectively. We have

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S})^{2}\right]\leq\epsilon^{2},

since $\mu_{\sigma}$ and $\mu_{S})$ are guaranteed to be within $\epsilon$ of each other by construction. Thus, Bob is $\epsilon^{2}$ -accurate, and likewise for Alice. By the $\frac{1}{2}$ -approximate triangle inequality for $G(x)=x^{2}$ , it follows that Alice and Bob $2\epsilon^{2}$ -agree. ∎

Appendix C Details Omitted From Section 4

Proof of Proposition 4.8.

Let $g=G^{\prime}$ . We have

	$\displaystyle\mathbb{E}\left[D_{G}(A\parallel B)\right]+\mathbb{E}\left[D_{G}(B\parallel C)\right]-\mathbb{E}\left[D_{G}(A\parallel C)\right]$
	$\displaystyle=\mathbb{E}\left[G(A)-G(B)-(A-B)g(B)+G(B)-G(C)-(B-C)g(C)-G(A)+G(C)+(A-C)g(C)\right]$
	$\displaystyle=\mathbb{E}\left[(A-B)(g(C)-g(B))\right]=\mathbb{E}\left[\mathbb{E}\left[(A-B)(g(C)-g(B))\mid\mathcal{F}\right]\right]$
	$\displaystyle=\mathbb{E}\left[(g(C)-g(B))\mathbb{E}\left[A-B\mid\mathcal{F}\right]\right]=\mathbb{E}\left[(g(C)-g(B))(\mathbb{E}\left[A\mid\mathcal{F}\right]-B)\right]=0.$

The third-to-last step follows from the fact that $g(C)-g(B)$ is $\mathcal{F}$ -measurable (we are using the “pulling out known factors” property of conditional expectation). The last step follows from the fact that $\mathbb{E}\left[A\mid\mathcal{F}\right]=B$ . ∎

Proposition C.1.

Let $G$ be a differentiable convex function on the interval $[0,1]$ . For all $0\leq a\leq b\leq 1$ , we have

(i)

$\frac{1}{2}(D_{G}(a\parallel x)+D_{G}(b\parallel x))\geq\text{JB}_{G}(a,b)$ for every $x\in[0,1]$ .
(ii)

$\text{JB}_{G}$ satisfies the reverse triangle inequality: for every $x\in[a,b]$ , we have $\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\leq\text{JB}_{G}(a,b)$ .
(iii)

For all $a\leq a^{\prime}\leq b^{\prime}\leq b$ , we have $\text{JB}_{G}(a^{\prime},b^{\prime})\leq\text{JB}_{G}(a,b)$ .

(iv)

For a random variable $X$ supported on $[a,b]$ , we have

\mathbb{E}\left[D_{G}(X\parallel\mathbb{E}\left[X\right])\right]=\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right])\leq 2\text{JB}_{G}(a,b).

Proof.

Fact (i) follows from Proposition 4.2. Regarding Fact (ii), without loss of generality assume that $x\leq\frac{a+b}{2}$ and that $G(x)=G\left(\frac{a+b}{2}\right)$ (uniformly adding a constant to the derivative of $G$ does not change any Jensen-Bregman divergence, hence the second assumption). Then $G\left(\frac{a+x}{2}\right)\geq G(x)$ , so $\text{JB}_{G}(a,x)\leq\frac{G(a)-G(x)}{2}$ . Since $\frac{b+x}{2}\geq\frac{a+b}{2}$ , we also have that $G\left(\frac{b+x}{2}\right)\geq G(x)$ , so $\text{JB}_{G}(b,x)\leq\frac{G(b)-G(x)}{2}$ . Thus, we have

\text{JB}_{G}(a,x)+\text{JB}_{G}(b,x)\leq\frac{G(a)+G(b)}{2}-G(x)=\frac{G(a)+G(b)}{2}-G\left(\frac{a+b}{2}\right)=\text{JB}_{G}(a,b).

Fact (iii) follows from Fact (ii): we have

\text{JB}_{G}(a,b)=\text{JB}_{G}(a,a^{\prime})+\text{JB}_{G}(a^{\prime},b^{\prime})+\text{JB}_{G}(b^{\prime},b)\geq\text{JB}_{G}(a^{\prime},b^{\prime}).

Regarding the equality in Fact (iv), we have

	$\displaystyle\mathbb{E}\left[D_{G}(X\parallel\mathbb{E}\left[X\right])\right]$	$\displaystyle=\mathbb{E}\left[G(X)-G(\mathbb{E}\left[X\right])-(X-\mathbb{E}\left[X\right])G^{\prime}(\mathbb{E}\left[X\right])\right]$
		$\displaystyle=\mathbb{E}\left[G(X)-G(\mathbb{E}\left[X\right])\right]=\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right]),$

where the first step follows from the fact that $\mathbb{E}\left[(X-\mathbb{E}\left[X\right])G^{\prime}(\mathbb{E}\left[X\right])\right]=G^{\prime}(\mathbb{E}\left[X\right])\mathbb{E}\left[X-\mathbb{E}\left[X\right]\right]$ , and $\mathbb{E}\left[X-\mathbb{E}\left[X\right]\right]=0$ .

Regarding the inequality in Fact (iv), without loss of generality assume that $\mathbb{E}\left[X\right]\leq\frac{a+b}{2}$ . By convexity we have that

G\left(\frac{a+b}{2}\right)\leq\frac{\frac{b-a}{2}}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])+\frac{\frac{a+b}{2}-\mathbb{E}\left[X\right]}{b-\mathbb{E}\left[X\right]}G(b),

	$\displaystyle\text{JB}_{G}(a,b)$	$\displaystyle=G(a)+G(b)-2G\left(\frac{a+b}{2}\right)$
		$\displaystyle\geq G(a)+G(b)-\frac{b-a}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])-\frac{a+b-2\mathbb{E}\left[X\right]}{b-\mathbb{E}\left[X\right]}G(b)$
		$\displaystyle=G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-\mathbb{E}\left[X\right]}G(b)-\frac{b-a}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])$
		$\displaystyle=\frac{b-a}{b-\mathbb{E}\left[X\right]}\left(\frac{b-\mathbb{E}\left[X\right]}{b-a}G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-a}G(b)-G(\mathbb{E}\left[X\right])\right)$
		$\displaystyle\geq\frac{b-\mathbb{E}\left[X\right]}{b-a}G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-a}G(b)-G(\mathbb{E}\left[X\right])\geq\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right]).$

In the last step we use the fact that for a convex function $f$ and a random variable $X$ defined on an interval $[a,b]$ with mean $\mu$ , the maximum possible value of $\mathbb{E}\left[f(X)\right]$ is attained if $X$ is either $a$ or $b$ with the appropriate probabilities. ∎

Proof of Theorem 4.11.

Suppose that Alice and Bob do not $\epsilon$ -agree at time step $t$ , and without loss of generality assume that the next turn (number $t+1$ ) is Alice’s. We begin by observing that, by Proposition C.1 (i), we have

\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})+D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq 2\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T_{t}},\mu_{S_{t}\tau})\right]>2\epsilon.

Therefore, either $\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3}$ or $\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}$ .

Case 1:

$\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3}$ . Let us use “hi,” “lo,” and “md” to denote the events that Alice says “high,” Alice says “low,” and Alice says “medium,” respectively. We have

	$\displaystyle\frac{2\epsilon}{3}$	$\displaystyle\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\mid S_{t},T_{t}\right]\right]$
		$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi or lo}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{md}}\mid S_{t},T_{t}\right]\right]$
		$\displaystyle\leq\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi or lo}}\mid S_{t},T_{t}\right]\right]+\frac{\epsilon}{2},$

where “ $\mid S_{t},T_{t}$ ” is short for “ $\mid\sigma\in S_{t},\tau\in T_{t}$ ,” a notation we use throughout the proof. We thus have

\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right]\geq\frac{\epsilon}{6}.

(9)

We now make use of the following lemma.

Lemma C.2.

Suppose that turn $t+1$ is Alice’s. Let “hi” denote the event that Alice says “high.” Let $\alpha:=\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]$ . Then

\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\geq\frac{\alpha\epsilon}{8M+2\epsilon}.

The analogous statement is true if Alice says “low,” and likewise if it is instead Bob’s turn.

We assume Lemma C.2 and return to prove it afterward. This lemma translates Equation 9 into a statement about how much Charlie learns. Specifically, we have that

	$\displaystyle\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\mid S_{t},T_{t}\right]\right]$
	$\displaystyle\geq\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right]$
	$\displaystyle\geq\frac{\epsilon}{8M+2\epsilon}(\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right])$
	$\displaystyle\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}.$

Case 2:

$\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}$ . Using the Pythagorean theorem to write the same Bregman divergence in two ways, we have that

	$\displaystyle\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t+1}T_{t+1}})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t}T_{t}})\right]$
	$\displaystyle=\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t}\tau})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}.$

This means that one of the two summands on the left-hand side is at least $\frac{2\epsilon}{3}$ .

Case 2a: $\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t+1}T_{t+1}})\right]\geq\frac{2\epsilon}{3}$ . In that case we have that

\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t+1}T_{t+1}})\right]\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}

by the same logic as in Case 1.

Case 2b: $\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3}\geq\frac{\epsilon^{2}}{12\epsilon}\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}$ .

In each of our cases, we have that

	$\displaystyle\mathbb{E}\left[D_{G}(Y\parallel\mu_{S_{t}T_{t}})-D_{G}(Y\parallel\mu_{S_{t+2}T_{t+2}})\right]=\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t}T_{t}})\right]$
	$\displaystyle=\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t+1}T_{t+1}})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}.$

Therefore, the total number of steps until agreement is first reached cannot be more than

2\cdot\frac{M}{\frac{\epsilon^{2}}{6(8M+2\epsilon)}}=\frac{24M(4M+\epsilon)}{\epsilon^{2}}.

This completes the proof. ∎

We now prove Lemma C.2.

Proof of Lemma C.2.

We will restrict our probability space to outcomes where Charlie knows $S_{t},T_{t}$ at time $t$ (and thus omit “ $\mid S_{t},T_{t}$ ” from here on). For convenience, we will let $A:=\mu_{\sigma T_{t}}$ be Alice’s expectation (a random variable) and $c:=\mu_{S_{t}T_{t}}$ be Charlie’s expectation (which is a particular number in $[0,1]$ ). We will let $\epsilon^{\prime}:=\frac{\epsilon}{2}$ , so that if Alice says “high” then Charlie knows that $A>c$ and that $D_{G}(A\parallel c)\geq\epsilon^{\prime}$ .

Let $D(x):=D_{G}(x\parallel c)=G(x)-G(c)-G^{\prime}(c)(x-c)$ , and let $\hat{a}_{h}:=\mathbb{E}\left[A\mid\text{hi}\right]$ . Note that if Alice says “high” then $\mu_{S_{t+1}T_{t+1}}=\hat{a}_{h}$ . In our new notation, we may write $\alpha=\mathbb{E}\left[D(A)\mid\text{hi}\right]\cdot\mathbb{P}\left[\text{hi}\right]$ , and we wish to show that $D(\hat{a}_{h})\cdot\mathbb{P}\left[\text{hi}\right]\geq\frac{\alpha\epsilon^{\prime}}{2(M+\epsilon^{\prime})}$ . Put otherwise, our goal is to show that

\frac{D(\hat{a}_{h})}{\mathbb{E}\left[D(A)\mid\text{hi}\right]}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}.

For convenience we will let $B$ denote the quantity on the left-hand side.

Let $a_{\text{hmin}}$ be the number larger than $c$ such that $D(a)=\epsilon^{\prime}$ , so that $A\geq a_{\text{hmin}}$ whenever Alice says “high.”¹³¹³13If $D(a)<\epsilon^{\prime}$ for all $a>c$ then Alice never says “high” and the lemma statement is trivial. Observe that since $D$ is convex (Bregman divergences are convex in their first argument), for a fixed value of $\hat{a}_{h}$ , the value of $\mathbb{E}\left[D(A)\mid\text{hi}\right]$ is maximized when $A$ is either $a_{\text{hmin}}$ or $1$ (with probabilities $\frac{1-\hat{a}_{h}}{1-a_{\text{hmin}}}$ and $\frac{\hat{a}_{h}-a_{\text{hmin}}}{1-a_{\text{hmin}}}$ , respectively). Therefore we have

B=\frac{D(\hat{a}_{h})}{\mathbb{E}\left[D(A)\mid\text{hi}\right]}\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{(1-\hat{a}_{h})\epsilon^{\prime}+(\hat{a}_{h}-a_{\text{hmin}})D(1)}.

(10)

Case 1:

$(1-\hat{a}_{h})\epsilon^{\prime}\geq(\hat{a}_{h}-a_{\text{hmin}})D(1)$ . In that case we have

B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(1-\hat{a}_{h})\epsilon^{\prime}}\geq\frac{\epsilon^{\prime}(1-a_{\text{hmin}})}{2(1-\hat{a}_{h})\epsilon^{\prime}}\geq\frac{1}{2}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}.

Case 2:

$(1-\hat{a}_{h})\epsilon^{\prime}\leq(\hat{a}_{h}-a_{\text{hmin}})D(1)$ . In that case we have

B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})D(1)}.

(11)

Case 2a: $D(1)\leq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime})$ . Then we have

B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})\cdot\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime})}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}\cdot\frac{(1-a_{\text{hmin}})(\hat{a}_{h}-c)}{(\hat{a}_{h}-a_{\text{hmin}})(1-c)}.

(In the last step we again use that $D(\hat{a}_{h})\geq\epsilon$ .) Now, it is easy to verify that the second fraction is at least $1$ (this comes down to the fact that $a_{\text{hmin}}\geq c$ ), so we indeed have that $B\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}$ .

Case 2b: $D(1)\geq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime})$ . We claim that for all $x\geq c$ , we have that

D(x)\geq\frac{x-c}{1-c}D(1)-M.

(12)

To see this, suppose for contradiction that for some $x$ we have $D(x)<\frac{x-c}{1-c}D(1)-M$ . Then

	$\displaystyle G(x)-G(c)-G^{\prime}(c)(x-c)$	$\displaystyle<\frac{x-c}{1-c}(G(1)-G(c)-G^{\prime}(c)(1-c))-M$
	$\displaystyle(1-c)G(x)-(1-c)G(c)$	$\displaystyle<(x-c)G(1)-(x-c)G(c)-(1-c)M$
	$\displaystyle G(x)+M$	$\displaystyle<\frac{(1-x)G(c)+(x-c)G(1)}{1-c}.$

On the other hand, we have that both $G(c)$ and $G(1)$ are less than or equal to $G(x)+M$ , by definition of $M$ . This means that

G(1),G(c)<\frac{(1-x)G(c)+(x-c)G(1)}{1-c}

but this implies that $G(1)<G(c)$ and that $G(c)<G(1)$ , a contradiction.

Plugging in $x=\hat{a}_{h}$ into Equation 12, we find that

D(\hat{a}_{h})\geq\frac{\hat{a}_{h}-c}{1-c}D(1)-M.

Plugging this bound into Equation 11, we get that

	$\displaystyle B$	$\displaystyle\geq\frac{\left(\frac{\hat{a}_{h}-c}{1-c}D(1)-M\right)(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})D(1)}=\frac{1-a_{\text{hmin}}}{2(\hat{a}_{h}-a_{\text{hmin}})}\cdot\frac{\hat{a}_{h}-c}{1-c}\left(1-\frac{M}{\frac{\hat{a}_{h}-c}{1-c}D(1)}\right)$
		$\displaystyle\geq\frac{1-a_{\text{hmin}}}{2(\hat{a}_{h}-a_{\text{hmin}})}\cdot\frac{\hat{a}_{h}-c}{1-c}\left(1-\frac{M}{M+\epsilon^{\prime}}\right)\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})},$

where in the second-to-last step we use that $D(1)\geq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime})$ and in the last step we again use the fact that $\frac{(1-a_{\text{hmin}})(\hat{a}_{h}-c)}{(\hat{a}_{h}-a_{\text{hmin}})(1-c)}\geq 1$ . ∎

Proof of Lemma 4.19.

We will partition $[0,1]$ into a number $N$ of small intervals $I_{1}=[x_{0}=0,x_{1})$ , $I_{2}=[x_{1},x_{2})$ , $I_{3}=[x_{2},x_{3})$ , …, $I_{N}=[x_{N-1},x_{N}=1]$ with certain desirable properties (which we will describe below). For $k\in[N]$ , we will let $S^{(k)}:=\{\sigma\in\mathcal{S}:\mu_{\sigma}\in I_{k}\}$ . For a given $\sigma\in\mathcal{S}$ , we will let $k(\sigma)$ be the $k$ such that $\sigma\in S^{(k)}$ .

Our goal is to upper bound the expectation of $D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})$ . In pursuit of this goal, we observe that by Proposition 4.8 we have

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]=\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k(\sigma))}\tau})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right].

(13)

Now, for any $k$ , by applying Equation 5 to $S=S^{(k)}$ and $T=\mathcal{T}$ , we know that

\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right]\geq\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k)}\tau})\mid S^{(k)}\right].

(Here, “ $\mid S^{(k)}$ ” is short for “ $\mid\sigma\in S^{(k)}$ .”) This is our only use of the rectangle substitutes assumption. Now, taking the expectation over $k$ (i.e. choosing each $k$ with probability equal to $\mathbb{P}\left[\sigma\in S^{(k)}\right]$ ), we have that

\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]\geq\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k(\sigma))}\tau})\right].

Together with Equation 13, this tells us that

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right].

(14)

Our goal will be to bound the two summands in Equation 14. We will specify the boundaries of the intervals $I_{1},\dots,I_{N}$ with this goal in mind.

On an intuitive level, we are hoping for two things to be true:

•

In order for the first summand to be small, we want $\mu_{\sigma}$ and $\mu_{S^{(k(\sigma))}}$ to be similar in value. In other words, we want each interval is “short” (for a notion of shortness with respect to $G$ that we are about to discuss).
•

In order for the second summand to be small, we want $\mu_{S^{(k(\sigma))}\tau}$ and $\mu_{\tau}$ to be similar in value. In other words, the estimate of a third party who knows $\tau$ shouldn’t change much upon learning $k(\sigma)$ . One way to ensure this is by creating the intervals in a way that makes the third party very confident about the value of $k(\sigma)$ before learning it. Intuitively this should be true because Alice and Bob approximately agree, so Alice’s estimate is likely to be close to Bob’s. However, we must be careful to strategically choose the boundaries of our intervals $x_{1},\dots,x_{N-1}$ so that Alice’s and Bob’s estimates are unlikely to be on opposite sides of a boundary.¹⁴¹⁴14This limits how many intervals we can reasonably use, which is why we cannot make our intervals arbitrarily short to satisfy the first of our two criteria.

What, formally, do we need for the first summand to be small? For any $k$ , we have $\mu_{S^{(k(\sigma))}}=\mathbb{E}\left[\mu_{\sigma}\mid\sigma\in S^{(k)}\right]$ . We can apply Proposition C.1 (iv) to the random variable $X=\mu_{\sigma}$ on the probability subspace given by $\sigma\in S^{(k)}$ . Since $X$ takes on values in $I_{k}$ , we have that

\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right]\leq 2\text{JB}_{G}(I_{k}),

(15)

where $\text{JB}_{G}(I_{k})$ is shorthand for the Jensen-Bregman divergence between the endpoints of $I_{k}$ . Therefore, if $\text{JB}_{G}(I_{k})$ is small for all $k$ , then the first summand (which is an expected value of $\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right]$ over $k\in[N]$ ) is also small.

What about the second summand? As per the intuition above, we wish to choose our boundary points $x_{1},\dots,x_{N-1}$ so that Alice’s and Bob’s estimates are unlikely to be on opposite sides of any boundary. Let $\mu_{-}=\min(\mu_{\sigma},\mu_{\tau})$ be the smaller of the two estimates and $\mu_{+}=\max(\mu_{\sigma},\mu_{\tau})$ be the larger one. We say that $\mu_{-},\mu_{+}$ thwart a point $x\in(0,1)$ if $\mu_{-}\leq x\leq\mu_{+}$ and $\mu_{-}\neq\mu_{+}$ . We define the thwart density of $x$ to be

\rho(x):=\mathbb{P}\left[\mu_{-},\mu_{+}\text{ thwart }x\right].

Roughly speaking, we will choose $x_{1},\dots,x_{N-1}$ such that $\rho(x_{k})$ is small on average.

We will approach this problem by first creating intervals to satisfy the first criterion (short intervals), without regard to the second, and then modifying them to satisfy the second without compromising the first. Formally, we choose our intervals according to the following algorithm.

Algorithm C.3 (Partitioning $[0,1]$ into intervals $I_{1},\dots,I_{N}$ ).

(1)

Choose points $0<x_{1}^{\prime}<x_{2}^{\prime}<\dots<x_{N-2}^{\prime}<1$ such that the $N-1$ intervals thus created all have Jensen-Bregman divergence between $\beta$ and $\frac{2\beta}{c}$ , inclusive, where $\beta$ and $c$ are as in the statement of Lemma 4.19. ( $N$ is not pre-determined; it is defined as one more than the number of intervals created.) (See footnote for why this is possible.¹⁵¹⁵15Define $x_{1}^{\prime}$ so that $\text{JB}_{G}(0,x_{1}^{\prime})=\frac{2\beta}{c}$ (this is possible because $\text{JB}_{G}$ is continuous in its arguments). Define $x_{2}^{\prime}$ so that $\text{JB}_{G}(x_{1}^{\prime},x_{2}^{\prime})=\frac{2\beta}{c}$ . Keep going until an endpoint $x_{N-3}^{\prime}$ is defined such that adding $x_{N-2}^{\prime}$ as before would leave an interval $(x_{N-2}^{\prime},1)$ with Jensen-Bregman divergence less than $\frac{2\beta}{c}$ . Now, instead of defining $x_{N-2}^{\prime}$ in this way, define it so that $\text{JB}_{G}(x_{N-3}^{\prime},x_{N-2}^{\prime})=\text{JB}_{G}(x_{N-2}^{\prime},1)$ . Since $\text{JB}_{G}(x_{N-3}^{\prime},1)\geq\frac{2\beta}{c}$ , the $c$ -approximate triangle inequality that we have by assumption tells us that $\text{JB}_{G}(x_{N-3}^{\prime},x_{N-2}^{\prime})=\text{JB}_{G}(x_{N-2}^{\prime},1)\geq\beta$ .)
(2)

Let $x_{0}^{\prime}:=0,x_{N-1}^{\prime}:=1$ for convenience. Define $I_{k}^{\prime}:=[x_{k-1}^{\prime},x_{k}^{\prime}]$ . For $k\in[N-1]$ , let $\alpha_{k}:=\inf_{x\in I_{k}^{\prime}}\rho(x)$ . Let $x_{k}\in I_{k}^{\prime}$ be such¹⁶¹⁶16If the infimum is achieved (e.g. if the space of signals to Alice and Bob is finite), then we can set $x_{k}:=\arg\min_{x}\rho(x)$ . Our algorithm works in more generality, at the expense of a factor of $2$ in our final bound. Note that by replacing $2$ with a smaller constant can arbitrarily reduce this factor. that $\rho(x_{k})\leq 2\alpha_{k}$ .
(3)

Return the intervals $I_{1}=[0,x_{1}),I_{2}=[x_{1},x_{2}),\dots,I_{N}=[x_{N-1},1]$ .

We begin by observing that for any $k\in[N]$ , we have

\text{JB}_{G}(I_{k})=\text{JB}_{G}(x_{k-1},x_{k})\leq\text{JB}_{G}(x_{k-2}^{\prime},x_{k}^{\prime})\leq\frac{1}{c}(\text{JB}_{G}(x_{k-2}^{\prime},x_{k-1}^{\prime})+\text{JB}_{G}(x_{k-1}^{\prime},x_{k}^{\prime}))\leq\frac{4\beta}{c^{2}}

where for convenience we define $x_{-1}^{\prime}:=0,x_{N}^{\prime}:=1$ . Therefore, by Equation 15, we have

\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]\leq\frac{8\beta}{c^{2}}.

(16)

It remains to bound the second summand of Equation 14, $\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]$ , which is the bulk of the proof. We proceed in two steps:

(1)

(Lemma C.4) We show that $\sum_{k=1}^{N}\alpha_{k}$ is small. This means that Alice’s and Bob’s estimates are unlikely to lie on opposite sides of some boundary point $x_{k}$ . As a consequence, Bob is highly likely to know $k(\sigma)$ with a lot of confidence
(2)

(Lemma C.7) We bound the second summand as a function of $\sum_{k=1}^{N}\alpha_{k}$ . The intuition is that if $\sum_{k}\alpha_{k}$ is small, then Bob is highly likely to know $k(\sigma)$ with a lot of confidence, which means that he does not learn too much from learning $k(\sigma)$ .

We begin with the first step; recall our notation $\mu^{-}:=\min(\mu_{\sigma},\mu_{\tau})$ and $\mu^{+}:=\max(\mu_{\sigma},\mu_{\tau})$ .

Lemma C.4.

2\sum_{k=1}^{N}\alpha_{k}\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

Proof.

We use the following claim, whose proof we provide afterward.

Claim C.5.

Let $I=[x^{-},x^{+}]$ be any sub-interval of $[0,1]$ and let $\alpha=\inf_{x\in I}\rho(x)$ . Then there is an increasing sequence of points $z_{0}:=x^{-},z_{1},z_{2},\dots,z_{L-1},z_{L}:=x^{+}$ , such that for every $\ell\in[L]$ , $\mathbb{P}\left[\mu_{-}\leq z_{\ell-1},\mu_{+}\geq z_{\ell}\right]\geq\frac{\alpha}{2}$ , and where

L\leq\frac{2}{\alpha}\sum_{\ell\in[L]}\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right].

We apply Claim C.5 to the intervals $I_{1}^{\prime},\dots,I_{N-1}^{\prime}$ , with $\alpha=\alpha_{k}$ . Let $z_{k,0},\dots,z_{k,L_{k}}$ be the points whose existence the claim proves, and let $r_{k}:=\sum_{\ell\in[L_{k}]}\mathbb{P}\left[\mu_{-}\leq z_{k,\ell-1}<\mu_{+}\leq z_{k,\ell}\right]$ , so that $L_{k}\leq\frac{2}{\alpha_{k}}r_{k}$ . Observe that $\sum_{k}r_{k}\leq 1$ , because the intervals $(z_{k,\ell-1},z_{k,\ell}]$ are disjoint for all $k,\ell$ . We make the following claim (we provide the proof afterward).

Claim C.6.

\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\leq\frac{\epsilon}{\beta c}.

(17)

We may rewrite Equation 17 as

\left(\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\right)^{1/(1-\log_{2}c)}\leq\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

Recall that $\sum_{k}r_{k}\leq 1$ . Scaling the $r_{k}$ ’s to add to $1$ decreases the left-hand side above, so we may assume that $\sum_{k}r_{k}=1$ . Note that $x^{1-\log_{2}c}$ is convex. Thus, by using a weighted Jensen inequality on the left-hand side with weights $r_{k}$ , we find that

\frac{1}{2}\sum_{k}\alpha_{k}=\sum_{k}r_{k}\cdot\frac{\alpha_{k}}{2r_{k}}\leq\left(\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\right)^{1/(1-\log_{2}c)}\leq\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

This completes the proof of Lemma C.4. ∎

Proof of Claim C.5.

Let $z_{1}=\inf\{z:\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\leq z\right]\geq\frac{\alpha}{2}\}$ , or $x^{+}$ if this number does not exist or is larger than $x^{+}$ . Note that $\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\right]\geq\alpha$ , as we have $\rho(z_{0})=\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\right]+\mathbb{P}\left[\mu_{-}<z_{0}=\mu_{+}\right]\geq\alpha$ , so if the first term were less than $\alpha$ we would have some $z^{\prime}>z_{0}$ with $\rho(z^{\prime})<\alpha$ . On the other hand, $\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}<z_{1}\right]\leq\frac{\alpha}{2}$ , since

\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}<z_{1}\right]=\lim_{z\to z_{1}\text{ from below}}\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\leq z\right]

and if the right-hand side were more than $\frac{\alpha}{2}$ then that would contradict the definition of $z_{1}$ as an infimum. Therefore, $\mathbb{P}\left[\mu_{-}\leq z_{0},\mu_{+}\geq z_{1}\right]\geq\frac{\alpha}{2}$ .

If $z_{1}=x^{+}$ , we are done. Otherwise, let $z_{2}=\inf\{z:\mathbb{P}\left[\mu_{-}\leq z_{1}<\mu_{+}\leq z\right]\geq\frac{\alpha}{2}\}$ . Then $\mathbb{P}\left[\mu_{-}\leq z_{1},\mu_{+}\geq z_{2}\right]\geq\frac{\alpha}{2}$ . Define $z_{3}$ analogously, and so forth.

All that remains to show is the upper bound on $L$ . This is where we use the fact that (by construction) $\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right]\geq\frac{\alpha}{2}$ . Summing over all $\ell$ , we have

\sum_{\ell\in[L]}\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right]\geq\frac{\alpha}{2}L,

which (after rearranging) completes the proof. ∎

Proof of Claim C.6.

First note that by construction, $\text{JB}_{G}(I_{k}^{\prime})\geq\beta$ for all $k$ . By repeated use of the $c$ -approximate triangle inequality,¹⁷¹⁷17We sub-divide $I_{k}^{\prime}$ into $[z_{k,0},z_{k,L_{k}/2}]$ and $[z_{k,L_{k}/2},z_{k,L}]$ , then subdivide each of these, and so on. we find that

\sum_{\ell\in[L_{k}]}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})\geq c^{\left\lceil\log_{2}L_{k}\right\rceil}\text{JB}_{G}(I_{k}^{\prime})\geq c^{1+\log_{2}\frac{2r_{k}}{\alpha_{k}}}\text{JB}_{G}(I_{k}^{\prime})\geq c^{1+\log_{2}\frac{2r_{k}}{\alpha_{k}}}\beta=c\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c}\beta.

On the other hand, we have

	$\displaystyle\epsilon$	$\displaystyle\geq\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma},\mu_{\tau})\right]=\sum_{\sigma,\tau}\mathbb{P}\left[\sigma,\tau\right]\text{JB}_{G}(\mu_{\sigma},\mu_{\tau})\geq\sum_{\sigma,\tau}\mathbb{P}\left[\sigma,\tau\right]\sum_{\begin{subarray}{c}k,\ell:\mu_{-}\leq z_{k,\ell-1}\\ \mu_{+}\geq z_{k,\ell}\end{subarray}}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})$
		$\displaystyle=\sum_{k,\ell}\mathbb{P}\left[\mu_{-}\leq z_{k,\ell-1},\mu_{+}\geq z_{k,\ell}\right]\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})\geq\sum_{k,\ell}\frac{\alpha_{k}}{2}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell}).$

Here, the third step follows by the reverse triangle inequality (Fact (ii) of Proposition C.1) and the fourth step follows by rearranging the order of summation.¹⁸¹⁸18The case that the space of signals is infinite is identical except that the summation is replaced by an integral over the probability space. Combining the last two facts gives us that

\epsilon\geq\sum_{k}\frac{\alpha_{k}}{2}\cdot c\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c}\beta=\sum_{k}r_{k}\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c-1}\beta c,

which rearranges to the desired identity. ∎

We are now ready to bound the second summand, i.e. $\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]$ , where $k(\sigma)$ is the $k$ such that Alice’s estimate $\mu_{\sigma}$ lies in $I_{k}$ . For convenience we will define $k(\tau)$ for Bob by analogy as the $k$ such that $\mu_{\tau}$ lies in $I_{k}$ . By Lemma C.4 and the preceding discussion, we know that

\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

Lemma C.7.

Let $Q=\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]$ . Then

\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]\leq 2\tilde{G}^{*}(Q).

The key idea is that because $k(\sigma)=k(\tau)$ with probability near $1$ , learning $k(\sigma)$ is unlikely to make Bob update his estimate much.

Proof.

Consider any signal $\hat{\tau}\in\mathcal{T}$ and let $p(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau}\right]$ . We have¹⁹¹⁹19This proof takes sums over $\hat{\tau}\in\mathcal{T}$ and thus implicitly assumes that $\mathcal{T}$ is finite, but the proof extends to infinite $\mathcal{T}$ , with sums over $\tau$ replaced by integrals with respect to the probability measure over $\mathcal{T}$ .

\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]=\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\hat{\tau}}\parallel\mu_{\hat{\tau}})\mid\tau=\hat{\tau}\right].

Note that $\mu_{\hat{\tau}}=\mathbb{E}\left[\mu_{S^{(k(\sigma))}\hat{\tau}}\mid\tau=\hat{\tau}\right]$ , so by Proposition C.1 we have that

\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]=\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\left(\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau}\right]-G(\mu_{\hat{\tau}})\right).

Let $q(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right]$ , so $\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})=Q$ . Then

	$\displaystyle\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau}\right]-G(\mu_{\hat{\tau}})=$
	$\displaystyle\frac{p(\hat{\tau})-q(\hat{\tau})}{p(\hat{\tau})}\left(\mathbb{E}\left[G(\mu_{S^{(k(\hat{\tau}))}\hat{\tau}})-G(\mu_{\hat{\tau}})\right]\right)+\frac{q(\hat{\tau})}{p(\hat{\tau})}\left(\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right]-G(\mu_{\hat{\tau}})\right).$

The second term is at most $\frac{q(\hat{\tau})}{p(\hat{\tau})}M$ , since $M$ is the range of $G$ . To bound the first term, we note that $\mu_{S^{(k(\hat{\tau}))}\hat{\tau}}$ cannot differ from $\mu_{\hat{\tau}}$ by more than $\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}$ , as otherwise the average value of $\mu_{S^{(k(\sigma))}\hat{\tau}}$ could not be $\mu_{\hat{\tau}}$ . Therefore, $\mathbb{E}\left[G(\mu_{S^{(k(\hat{\tau}))}\hat{\tau}})-G(\mu_{\hat{\tau}})\right]$ is bounded by the largest possible difference in $G$ -values of two points that differ by at most $\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}$ . Therefore, we have

	$\displaystyle\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]$	$\displaystyle\leq\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\left(\frac{p(\hat{\tau})-q(\hat{\tau})}{p(\hat{\tau})}\tilde{G}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right)+\frac{q(\hat{\tau})}{p(\hat{\tau})}M\right)$
		$\displaystyle\leq QM+\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\tilde{G}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right),$

where $\tilde{G}$ is defined as in the statement of Lemma C.7. If $G$ is symmetric on $[0,1]$ , then $\tilde{G}(x)=G(0)-G(x)$ for $x\leq\frac{1}{2}$ and $M$ otherwise. This is a concave function, but $\tilde{G}$ is not in general concave. However, consider $\tilde{G}^{*}$ as defined in the lemma statement, so $\tilde{G}(x)\leq\tilde{G}^{*}(x)$ for all $x$ . Then

	$\displaystyle\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]$	$\displaystyle\leq QM+\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\tilde{G}^{*}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right)$
		$\displaystyle\leq QM+\left(\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\right)\cdot\tilde{G}^{*}\left(\frac{\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})}{\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))}\right)$
		$\displaystyle=QM+(1-Q)\tilde{G}^{}\left(\frac{Q}{1-Q}\right)\leq QM+\tilde{G}^{}(Q)\leq 2\tilde{G}^{*}(Q).$

Here, the second step follows by Jensen’s inequality with terms $\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}$ and weights $p(\hat{\tau})-q(\hat{\tau})$ , the second-to-last step follows from the fact that $\tilde{G}^{*}$ is convex and $\tilde{G}^{*}(0)=0$ , and the last step follows from the fact that $\tilde{G}^{*}$ is convex and $\tilde{G}^{*}(1)=M$ . ∎

Since $Q\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}$ , combining Lemma C.7 with Equation 16 gives us the following result.

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\frac{8\beta}{c^{2}}+2\tilde{G}^{*}\left(4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}\right).

Noting that $\tilde{G}^{*}$ is concave and $c^{-1/(1-\log_{2}c)}\leq 2$ (which is true for all $0<c<1$ ) completes the proof of Lemma 4.19. ∎

Appendix D Alternative Definitions of Agreement and Accuracy

For arbitrary Bregman divergences, there are several notions of agreement and accuracy that are worth considering. Before we discuss these, we make a note about the order of arguments in a Bregman divergence. In our context, it makes the most sense to talk of the Bregman divergence from a more informed estimate to a less informed estimate. By a “more informed estimate” we mean a finer-grained one, i.e. one that is informed by more knowledge. For example, in terms of estimating $Y$ in the context of this work explores, $Y$ is more informed than $\mu_{\sigma\tau}$ , which is more informed than $\mu_{\sigma}$ and $\mu_{S\tau}$ , which are each more informed than $\mu_{ST}$ , which is more informed than $\mathbb{E}\left[Y\right]$ .

To see that this is the natural order of the arguments, recall that Bregman divergences are motivated by the property that they elicit the mean (see Proposition 4.2): if an agent who gives an estimate of $x$ for the value of a random variable $Y$ incurs a loss of $D_{G}(Y\parallel x)$ , then the agent minimizes their expected loss by reporting $x=\mathbb{E}\left[Y\right]$ . This means that the expert ought to report the expected value of $Y$ given the information that the expert knows.

This means that given two estimates of $Y$ , $Z_{1}$ and $Z_{2}$ , of which $Z_{1}$ is more informed, the quantity $D_{G}(Z_{1}\parallel Z_{2})$ has a natural interpretation: it is the expected amount the expert gains by learning more and refining their estimate from $Z_{2}$ to $Z_{1}$ . This follows by the Pythagorean theorem:

\mathbb{E}\left[D_{G}(Z_{1}\parallel Z_{2})\right]=\mathbb{E}\left[D_{G}(Y\parallel Z_{2})\right]-\mathbb{E}\left[D_{G}(Y\parallel Z_{1})\right].

D.1 Alternative Definitions of Agreement

One important motivation for using the Jensen-Bregman divergence to the midpoint as the definition of agreement is that this quantity serves as a lower bound on the expected amount that Charlie disagrees with Alice and Bob. Formally:

Definition D.1.

Let $a$ , $b$ , and $c$ be Alice’s, Bob’s, and Charlie’s expectations, respectively (these are random variables on $\Omega$ ). Alice and Bob $\epsilon$ -agree with Charlie if $\frac{1}{2}(\mathbb{E}\left[D_{G}(a\parallel c)+D_{G}(b\parallel c)\right])\leq\epsilon$ .

(This is the order of arguments because Alice and Bob are more informed than Charlie.) By Proposition C.1 (i), we know that if Alice and Bob $\epsilon$ -agree with Charlie then they $\epsilon$ -agree.

As it happens the fact that under this (stronger) definition of agreement implies accuracy under rectangle substitutes follows immediately:

Proposition D.2.

Let $\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y)$ be an information structure that satisfies rectangle substitutes. For any communication protocol that causes Alice and Bob to $\epsilon$ -agree with Charlie on $\mathcal{I}$ , Alice and Bob are $2\epsilon$ -accurate after the protocol terminates.

Proof.

Let $S$ be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define $T$ likewise for Bob. Recall that Charlie’s expectation is $\mu_{ST}$ . We have

\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\right]+\mathbb{E}\left[D_{G}(\mu_{S\tau}\parallel\mu_{ST})\right]\leq 2\epsilon,

where the first inequality follows by rectangle substitutes and the last inequality follows because Alice and Bob $\epsilon$ -agree with Charlie. ∎

The drawback of Definition D.1 is that it is not so much a definition of Alice and Bob’s agreement with each other, so much as a definition of agreement with respect to the protocol being run (since Charlie only exists within the context of the protocol). Put otherwise, it is impossible to determine whether Alice and Bob $\epsilon$ -agree with Charlie simply by knowing Alice and Bob’s expectations; one must also know Charlie’s expectation, which cannot be determined from Alice’s and Bob’s expectations. The question “how far from agreement are Alice and Bob if Alice believes 25% and Bob believes 30%?” makes sense in the context of $\epsilon$ -agreement, but not in the context of $\epsilon$ -agreement with Charlie.

A different notion of agreement, which (like $\epsilon$ -agreement) only depends on Alice’s and Bob’s expectations, uses the symmetrized Bregman divergence between these expectations: $\frac{1}{2}(D_{G}(a\parallel b)+D_{G}(b\parallel a))$ .

Definition D.3.

Let $a$ and $b$ be Alice’s and Bob’s expectations, respectively (these are random variables on $\Omega$ ). Alice and Bob satisfy symmetrized $\epsilon$ -agreement if $\frac{1}{2}(D_{G}(a\parallel b)+D_{G}(b\parallel a))$ .

By Proposition C.1 (iii), we know that if Alice and Bob satisfy symmetrized $\epsilon$ -agreement then they $\epsilon$ -agree.

In our context, symmetrized Bregman divergence is less natural than Jensen-Bregman divergence. This is symmetrized Bregman divergence (unlike Jensen-Bregman divergence) does not seem to closely relate to our previous discussion of the Bregman divergence from a more informed to a less informed estimate being most natural.

D.2 Alternative Notions of Accuracy

Our definition of Alice’s accuracy as the expected Bregman divergence from the truth $\mu_{\sigma\tau}$ to Alice’s expectation seems like the most natural one. However, one may desire a definition of accuracy that takes both Alice’s and Bob’s expectations into account, judging the pair’s accuracy based on their consensus belief, rather than each of their individual beliefs. For instance, one could say that Alice and Bob are $\epsilon$ -midpoint-accurate if $\mathbb{E}\left[D_{G}\left(\mu_{\sigma\tau}\parallel\frac{a+b}{2}\right)\right]\leq\epsilon$ . By this definition, Alice’s and Bob’s expectations could individually be far from the truth, but they are considered accurate because the average of their expectations is close to correct.

Proposition D.4.

If Alice and Bob are $\epsilon$ -accurate, then they are $2\epsilon$ -midpoint-accurate.

Proof.

Observe that for all $a,b,y$ it is the case that

D_{G}\left(y\parallel\frac{a+b}{2}\right)\leq\max(D_{G}(y\parallel a),D_{G}(y\parallel b)\leq D_{G}(y\parallel a)+D_{G}(y\parallel b).

The first inequality is true simply because $\frac{a+b}{2}$ lies in between $a$ and $b$ . Therefore,

\mathbb{E}\left[D_{G}\left(y\parallel\frac{a+b}{2}\right)\right]\leq\mathbb{E}\left[D_{G}(y\parallel a)+D_{G}(y\parallel b)\right]\leq 2\epsilon.

∎

Another natural choice for Alice’s and Bob’s consensus belief is the QA pool (see [NR21a]). Proposition D.4 likewise holds for the QA pool in place of the midpoint, and indeed holds for any choice of consensus belief that is guaranteed to lie in between Alice’s and Bob’s expectations. Thus, any such definition will be weaker than our definition of $\epsilon$ -accuracy for Alice and Bob (up to a constant factor).

To summarize, among the above definitions of agreement, $\epsilon$ -agreement is the weakest; and among the above definitions of accuracy, Alice’s and Bob’s $\epsilon$ -accuracy is the strongest. This is an indication of strength for Theorem 4.18: it starts from a relatively weak premise and reaches a relatively strong conclusion.

Appendix E Implications for Communication Complexity

Our results can be framed in a communication complexity context, where they imply that “substitutable” functions can be computed with probability $1-\delta$ (over the inputs) with a transcript length depending only on $\delta$ . This is a nonstandard and weak notion of computing the function, but sketching the reduction may inspire future work on connections between substitutes and communication complexity.

In a classic deterministic communication complexity setup (e.g. [RY20]), Alice holds $\sigma\in\mathcal{S}$ , Bob holds $\tau\in\mathcal{T}$ , and the goal is to compute some function $g:\mathcal{S}\times\mathcal{T}\to\{0,1\}$ using a communication protocol (see Section 2.2). Our setting captures this model when $Y=g(\sigma,\tau)$ . Observe that in this case, $Y=\mu_{\sigma\tau}$ , i.e. Alice and Bob’s information together determine $Y$ completely. A communication protocol defines its output by a function $h:\Pi\to\{0,1\}$ where $\Pi$ is the space of transcripts. We can simply let $h(\pi)=\text{round}(\mu_{ST})$ , i.e. rounding the ex post expectation $\mathbb{E}\left[Y\mid\pi\right]=\mu_{ST}$ to either zero or one. This is equivalent to the belief of “Charlie”, or the common knowledge of Alice and Bob after the protocol is completed.

Definition E.1 (Rectangle substitutes, $(1-\delta)$ -computes).

Given a function $g$ and a distribution $\mathcal{D}$ over $\mathcal{S}\times\mathcal{T}$ , we say $(g,\mathcal{D})$ satisfy rectangle substitutes if the corresponding information structure with $Y=g(\sigma,\tau)$ satisfies rectangle substitutes (Definition 2.6). We say a protocol $(1-\delta)$ -computes $g$ over $\mathcal{D}$ if, with probability at least $1-\delta$ over $(\sigma,\tau)\sim\mathcal{D}$ , the protocol has $h(\pi)=g(\sigma,\tau)$ .

By our results, under rectangle substitutes $(g,\mathcal{D})$ , any agreement protocol approximately computes $g$ over $\mathcal{D}$ . More precisely, using a fast substitutes-agreement protocol similar to Proposition B.1, we obtain the following.

Corollary E.2.

Suppose $(g,\mathcal{D})$ satisfy rectangle substitutes. Then for every $\delta\in(0,1)$ , there is a deterministic communication protocol using $O(\log(1/\delta))$ bits of communication that $(1-\delta)$ -computes $g$ over $\mathcal{D}$ .

Proof.

In round one, Alice sends her current expectation $\mu_{\sigma}$ rounded to a multiple of $\epsilon$ ; call this message $A$ . In round two, Bob sends his updated expectation $\mu_{S\tau}$ rounded to a multiple of $\epsilon$ ; call this message $B$ . The protocol then halts, and the output is $B$ rounded to either zero or one. It uses $O(\log(1/\epsilon))$ bits. Let $S,T$ be the random rectangle associated with the protocol.

By construction, $|\mu_{\sigma}-A|\leq\epsilon$ , and $\mu_{S}$ is the expectation of $Y$ conditioned on $A$ , so it follows that $|\mu_{\sigma}-\mu_{S}|\leq\epsilon$ . Using substitutes (just as in Proposition B.1),

\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S})^{2}\right]\leq\epsilon^{2}.

By construction, $|B-\mu_{S\tau}|\leq\epsilon$ . Therefore, by the $\tfrac{1}{2}$ -approximate triangle inequality for squared distance (e.g. Proposition 4.13)),

\mathbb{E}\left[(\mu_{\sigma\tau}-B)^{2}\right]~{}\leq~{}2\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]+2\mathbb{E}\left[(\mu_{S\tau}-B)^{2}\right]~{}\leq~{}2\epsilon^{2}.

Now, the protocol is incorrect if $|B-\mu_{\sigma\tau}|\geq\tfrac{1}{2}$ . Using Markov’s inequality,

	$\displaystyle\Pr[\|B-\mu_{\sigma\tau}\|\geq\tfrac{1}{2}]$	$\displaystyle=\Pr[(B-\mu_{\sigma\tau})^{2}\geq\tfrac{1}{4}]$
		$\displaystyle\leq 4\mathbb{E}\left[(B-\mu_{\sigma\tau})^{2}\right]$
		$\displaystyle\leq 8\epsilon^{2}.$

Therefore, given $\delta\in(0,1)$ , we run the protocol with $\epsilon=\sqrt{\delta/8}$ . The probability of an incorrect output is at most $\delta$ , and we use $O(\log(1/\epsilon)=O(\log(1/\delta))$ bits of communication. ∎

Agreement Implies Accuracy for Substitutable Signals

1 Introduction

1.1 Overview of approach and results

1.2 Related Work

2 Preliminaries

2.1 Information Structures

2.2 Agreement Protocols

Definition 2.1 (ϵ\epsilon-agree).

Definition 2.2 (ϵ\epsilon-agreement protocol).

Definition 2.3 (Discretized protocol, [Aar05]).

Theorem 2.4 ([Aar05, Theorem 4]).

2.3 Accuracy and Informational Substitutes

Definition 2.5 (ϵ\epsilon-accurate).

Definition 2.6.

Interpreting substitutes.

2.4 The Pythagorean Theorem

Proposition 2.7 (Pythagorean theorem).

Remark 2.8.

3 Results for Squared Distance

3.1 Agreement Implies Accuracy

Theorem 3.1.

Lemma 3.2.

Proof of Theorem 3.1.

Claim 3.3.

Corollary 3.4.

Remark 3.5.

Corollary 3.6.

Proof.

3.2 Graceful Decay Under Closeness to Rectangle Substitutes

Definition 3.7.

Remark 3.8.

Theorem 3.9.

Proof.

Lemma 3.10.

4 Results for Other Divergence Measures

4.1 Preliminaries on Bregman Divergences

Definition 4.1.

Proposition 4.2 ([Ban+05]).

Definition 4.3.

Definition 4.4.

Definition 4.5.

Definition 4.6.

Definition 4.7.

Proposition 4.8.

Remark 4.9.

4.2 Generalizing the Discretized Protocol

Definition 4.10.

Theorem 4.11.

4.3 Approximate Triangle Inequality

Definition 4.12.

Proposition 4.13.

Proof.

4.4 Generalized Agreement Implies Generalized Accuracy

Theorem 4.14.

Corollary 4.15.

Remark 4.16.

Corollary 4.17.

Theorem 4.18.

Proof.

Lemma 4.19.

5 Connections to Markets

References

Appendix A Details Omitted From Section 2

Appendix B Details Omitted From Section 3

Proof of Claim 3.3.

Proof of Lemma 3.2.

Proposition B.1.

Proof.

Appendix C Details Omitted From Section 4

Proof of Proposition 4.8.

Proposition C.1.

Proof.

Proof of Theorem 4.11.

Case 1:

Lemma C.2.

Case 2:

Proof of Lemma C.2.

Case 1:

Case 2:

Proof of Lemma 4.19.

Definition 2.1 ( $\epsilon$ -agree).

Definition 2.2 ( $\epsilon$ -agreement protocol).

Definition 2.5 ( $\epsilon$ -accurate).

Algorithm C.3 (Partitioning $[0,1]$ into intervals $I_{1},\dots,I_{N}$ ).

Definition E.1 (Rectangle substitutes, $(1-\delta)$ -computes).