This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Agreement Implies Accuracy for Substitutable Signals

Rafael Frongillo, Eric Neyman, Bo Waggoner

Inspired by Aumann’s agreement theorem, [Aar05] studied the amount of communication necessary for two Bayesian experts to approximately agree on the expectation of a random variable. Aaronson showed that, remarkably, the number of bits does not depend on the amount of information available to each expert. However, in general the agreed-upon estimate may be inaccurate: far from the estimate they would settle on if they were to share all of their information. We show that if the experts’ signals are substitutes—meaning the experts’ information has diminishing marginal returns—then it is the case that if the experts are close to agreement then they are close to the truth. We prove this result for a broad class of agreement and accuracy measures that includes squared distance and KL divergence. Additionally, we show that although these measures capture fundamentally different kinds of agreement, Aaronson’s agreement result generalizes to them as well.

1 Introduction

Suppose that Alice and Bob are honest, rational Bayesians who wish to estimate some quantity—say, the unemployment rate one year from now. Alice is an expert on historical macroeconomic trends, while Bob is an expert on contemporary monetary policy. They convene to discuss and share their knowledge with each other until they reach an agreement about the expected value of the future unemployment rate. Alice and Bob could reach agreement by sharing everything they had ever learned, at which point they would have the same information, but the process would take years. How then should they proceed?

In the seminal work “Agreeing to Disagree,” Aumann [Aum76] observed that Alice and Bob can reach agreement simply by taking turns sharing their current expected value for the quantity. In addition to modeling communication between Bayesian agents, protocols similar to this one model financial markets: each trader shares partial information about their expected value on their turn (discussed in Section 5). A remarkable result by Scott Aaronson [Aar05] shows that if Alice and Bob follow certain protocols of this form, they will agree to within ϵ\epsilon with probability 1δ1-\delta by communicating O(1δϵ2)O\left(\frac{1}{\delta\epsilon^{2}}\right) bits.111To ensure that each message is short, Alice and Bob share discretized versions of their estimates; we discuss this in Section 2. Notably, this bound only depends on the error Alice and Bob are willing to tolerate, and not on the amount of information available to them.

Absent from Aaronson’s results, however, is what Alice and Bob agree on. In particular, there is no guarantee that Alice and Bob will be accurate, meaning their agreed-upon estimate will be close (in e.g. expected squared distance) to what they would believe if they shared all of their information. In fact, they might agree on something highly inaccurate: suppose that Alice and Bob have independent, uniformly random bits bA,bBb_{A},b_{B}, and wish to estimate the XOR bAbBb_{A}\oplus b_{B}. Alice and Bob agree from the onset, as from each of their perspectives, the expected value of bAbBb_{A}\oplus b_{B} is 12\frac{1}{2}. Yet this expectation is far from the best estimate given their collective knowledge, which is either 0 or 11. So while agreement is fundamental to understanding communication between Bayesians—in Aumann’s terms, they cannot “agree to disagree”—agreement is far from the whole story. An important open problem is therefore what assumptions guarantee that Alice and Bob are accurate once they agree.

We address this open problem by introducing a natural condition, called rectangle substitutes, under which agreement implies accuracy. Rectangle substitutes is a notion of informational substitutes: the property that additional information has diminishing marginal returns. The notion of substitutes is ubiquitous in optimization problems, and informational substitutes conditions have recently been used to analyze equilibria in markets [CW16]. We show that under the rectangle substitutes condition, any protocol leading to agreement will also lead to accuracy. We then extend these results beyond the case of squared error, to a broad family of measures of agreement and accuracy including KL divergence.

1.1 Overview of approach and results

In [Aar05], Alice and Bob are said to agree if the squared distance between their estimates is small. Likewise, we can say that Alice and Bob are accurate if the squared distance between each of their estimates and the truth is small. In Section 3 we present our first main result: under these definitions, if the information structure satisfies rectangle substitutes, then agreement implies accuracy. In other words, under this assumption, when two Bayesians agree—regardless of how little information they have shared—they necessarily agree on the truth.

The proof involves carefully partitioning the space of posterior beliefs induced by the protocol. Agreement is used to show that Alice and Bob usually fall into the same partition element, which means that Bob would not learn much from learning the partition element of Alice’s expectation. Then, the rectangle substitutes condition is used to show that if Bob were to learn Alice’s partition element, then he would be very close to knowing the truth.

Aaronson measures agreement in terms of squared error, yet other measurements like KL divergence may be better suited for some settings. For example, if Alice and Bob estimate the probability of a catastrophic event as 101010^{-10} and 10210^{-2}, respectively, then under squared error they are said to agree closely, but arguably they disagree strongly, as reflected by their large KL divergence. Motivated these different ways to measure agreement, we next ask:

  1. 1.

    Can Aaronson’s protocols be generalized to other notions of agreement, such that the number of bits communicated is independent of the amount of information available to Alice and Bob?

  2. 2.

    Do other notions of agreement necessarily imply accuracy under rectangle substitutes?

In Section 4, we give our second and third main results: the answer to both questions is yes. Specifically, the positive results apply when when measuring agreement and accuracy using Bregman divergences, a class of error measures that includes both squared distance and KL divergence.222The third result holds under an “approximate triangle inequality” condition on the Bregman divergence, which is satisfied by most or all natural choices; indeed, it is nontrivial to construct a Bregman divergence that does not satisfy this property.

Aaronson’s proof of his agreement theorem turns out to be specific to squared distance. Our agreement theorem (Theorem 4.11) modifies Aaronson’s protocol to depend on the particular Bregman divergence, i.e. the relevant error measure. It then proceeds in a manner inspired by Aaronson but using several new ideas. Our proof that agreement implies accuracy under rectangle substitutes for general Bregman divergences also involves some nontrivial changes to our proof for squared distance. In particular, the fact that the length of an interval cannot be inferred from the Bregman divergence between its endpoints necessitates a closer analysis of the partition of Alice’s and Bob’s beliefs.

We conclude in Section 5 with a discussion of connections between agreement protocols and information revelation in financial markets, and discuss an interesting potential avenue for future work.

1.2 Related Work

Our setting is related to but distinct from communication complexity. In that field (e.g. [RY20]), the goal is for Alice and Bob to correctly compute a function of their inputs while communicating as few bits as possible and using any protocol necessary. By contrast, [Aar05] considered a goal of agreement, not correctness, and focused on specific natural protocols, which he showed achieve this goal in a constant number of bits. Our work focuses on Aaronson’s setting. We discuss how our results might be framed in terms of communication complexity in Appendix E.

Our introduction of the substitutes condition is inspired by its usefulness in prediction markets [CW16]. The “expectation-sharing” agreement protocols we study bear a strong similarity to dynamics of market prices. [Ost12] introduced a condition under which convergence of prices in a market implies that all information is aggregated. This can be viewed as an “agreement implies accuracy” condition. We discuss these works and the connection of our work to markets in Section 5. Another similar definition of informational substitutes is used by [NR21] in the context of robust aggregation of forecasts.

Finally, we note that the “agreement protocols” we study are not related to key agreement protocols in cryptography, where the goal is for two communicating parties to jointly construct a shared string for cryptographic use.

2 Preliminaries

2.1 Information Structures

We consider a set Ω\Omega of states of the world, with a probability distribution \mathbb{P} over the world states. There are two experts, Alice and Bob. Alice learns the value of a random variable σ:Ω𝒮\sigma:\Omega\to\mathcal{S}; we call σ\sigma Alice’s signal and 𝒮\mathcal{S} her signal set. Correspondingly, Bob learns the value of a random variable τ:Ω𝒯\tau:\Omega\to\mathcal{T}. These signals each convey partial information about the true state ωΩ\omega\in\Omega. Alice and Bob are interested in a third random variable Y:Ω[0,1]Y:\Omega\to[0,1]. We use the term information structure to refer to the tuple :=(Ω,,𝒮,𝒯,Y)\mathcal{I}:=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y).

We denote by μστ:=𝔼[Yσ,τ]\mu_{\sigma\tau}:=\mathbb{E}\left[Y\mid\sigma,\tau\right] the random variable that is equal to the expected value of YY conditioned on both Alice’s signal σ\sigma and Bob’s signal τ\tau. We also define μσ:=𝔼[Yσ]\mu_{\sigma}:=\mathbb{E}\left[Y\mid\sigma\right] and μτ:=𝔼[Ymodτ]\mu_{\tau}:=\mathbb{E}\left[Y\mod\tau\right]. For a measurable set S𝒮S\subseteq\mathcal{S}, we define μS:=𝔼[YσS]\mu_{S}:=\mathbb{E}\left[Y\mid\sigma\in S\right]; we define μT\mu_{T} analogously for T𝒯T\subseteq\mathcal{T}. Additionally, for T𝒯T\subseteq\mathcal{T}, we define μσT:=𝔼[YτT,σ]\mu_{\sigma T}:=\mathbb{E}\left[Y\mid\tau\in T,\sigma\right], i.e. the expected value of YY conditioned on the particular value of σ\sigma and the knowledge that τT\tau\in T. If Alice knows that Bob’s signal belongs to TT (and nothing else about his signal), then the expected value of YY conditional on her information is μσT\mu_{\sigma T}; we refer to this as Alice’s expectation. Likewise, for S𝒮S\subseteq\mathcal{S}, we define μSτ:=𝔼[YσS,τ]\mu_{S\tau}:=\mathbb{E}\left[Y\mid\sigma\in S,\tau\right]. Finally, we define μST:=𝔼[YσS,τT]\mu_{ST}:=\mathbb{E}\left[Y\mid\sigma\in S,\tau\in T\right]. This is the expectation of a third party who only knows that σS\sigma\in S and τT\tau\in T.

In general we often wish to take expectations conditioned on σS,τT\sigma\in S,\tau\in T (for some S𝒮,T𝒯S\subseteq\mathcal{S},T\subseteq\mathcal{T}). We will use the shorthand 𝔼[S,T]\mathbb{E}\left[\cdot\mid S,T\right] for 𝔼[σS,τT]\mathbb{E}\left[\cdot\mid\sigma\in S,\tau\in T\right] in such cases.

2.2 Agreement Protocols

The notion of agreement between Alice and Bob is central to our work. We first define agreement in terms of squared error, and generalize to other error measures in Section 4.

Definition 2.1 (ϵ\epsilon-agree).

Let aa and bb be Alice’s and Bob’s expectations, respectively (aa and bb are random variables on Ω\Omega). Alice and Bob ϵ\epsilon-agree if 14𝔼[(ab)2]ϵ\frac{1}{4}\mathbb{E}\left[(a-b)^{2}\right]\leq\epsilon.

The constant 14\frac{1}{4} makes the left-hand side represent Alice’s and Bob’s distance to the average of their expectations.

Our setting follows [Aar05], which examined communication protocols that cause Alice and Bob to agree. In a (deterministic) communication protocol, Alice and Bob take turns sending each other messages. On Alice’s turns, Alice communicates a message that is a deterministic function of her input (i.e. her signal σ\sigma) and all previous communication, and likewise for Bob on his turns. A rectangle is a set of the form S×TS\times T where S𝒮S\subseteq\mathcal{S} and T𝒯T\subseteq\mathcal{T}. The transcript of the protocol at a time step tt (i.e. after tt messages have been sent) partitions Ω\Omega into rectangles: for any given sequence of tt messages, there are subsets St𝒮,Tt𝒯S_{t}\subseteq\mathcal{S},T_{t}\subseteq\mathcal{T} such that the protocol transcript at time tt is equal to this sequence if and only if (σ,τ)St×Tt(\sigma,\tau)\in S_{t}\times T_{t}. For a given communication protocol, we may think of StS_{t} and TtT_{t} as random variables. Alice’s expectation at time tt (i.e. after the tt-th message has been sent) is μσTt\mu_{\sigma T_{t}} and Bob’s expectation at time tt is μStτ\mu_{S_{t}\tau}. Finally, the protocol terminates at a certain time (which need not be known in advance of the protocol). While typically in communication complexity a protocol is associated with a final output, in this case we are interested in Alice’s and Bob’s expectations, so we do not require an output.

It will be convenient to hypothesize a third party observer, whom we call Charlie, who observes the protocol but has no other information. At time tt, Charlie has expectation μStTt\mu_{S_{t}T_{t}}. Charlie’s expectation can also be interpreted as the expectation of YY according to Alice and Bob’s common knowledge.

The following definition formalizes the relationship between communication protocols and agreement.

Definition 2.2 (ϵ\epsilon-agreement protocol).

Given an information structure \mathcal{I}, a communication protocol causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I} if Alice and Bob ϵ\epsilon-agree at the end of the protocol, i.e., if 14𝔼[(ab)2]ϵ\frac{1}{4}\mathbb{E}\left[(a-b)^{2}\right]\leq\epsilon, where the expected value is over Alice’s and Bob’s inputs. We say that a communication protocol is an ϵ\epsilon-agreement protocol if the protocol causes Alice and Bob to ϵ\epsilon-agree on every information structure.

Aaronson defines and analyzes two ϵ\epsilon-agreement protocols.333A minor difference to our framing is that [Aar05] focuses on probable approximate agreement: protocols that cause the absolute difference between Alice and Bob to be at most ϵ\epsilon with probability all but δ\delta. While the results as presented in this section are stronger than those in [Aar05] (the original results follow from these as a consequence of Markov’s inequality), these results follow from a straightforward modification of his proofs. The first of these is the standard protocol, in which Alice and Bob take turns stating their expectations for a number of time steps that can be computed by Alice and Bob independently in advance of the protocol, and which is guaranteed to be at most O(1/ϵ)O(1/\epsilon).

The fact that exchanging their expectations for O(1/ϵ)O(1/\epsilon) time steps results in ϵ\epsilon-agreement is profound and compelling. However, the standard protocol may require an unbounded number of bits of communication, since Alice and Bob are exchanging real numbers. To address this, Aaronson defines another agreement protocol that is truly polynomial-communication (which we slightly modify for our purposes):

Definition 2.3 (Discretized protocol, [Aar05]).

Choose ϵ>0\epsilon>0. In the discretized protocol with parameter ϵ\epsilon, on her turn (at time tt), Alice sends “low” if her expectation is smaller than Charlie’s by more than ϵ/4\epsilon/4, i.e. if μSt1τ<μSt1Tt1ϵ/4\mu_{S_{t-1}\tau}<\mu_{S_{t-1}T_{t-1}}-\epsilon/4; “high” if her expectation is larger than Charlie’s by more than ϵ/4\epsilon/4; and “medium” otherwise. Bob acts analogously on his turn. At the start of the protocol, Alice and Bob use the information structure to independently compute the time tend1000ϵt_{\text{end}}\leq\frac{1000}{\epsilon} that minimizes 𝔼[(μσTtendμStendτ)2]\mathbb{E}\left[(\mu_{\sigma T_{t_{\text{end}}}}-\mu_{S_{t_{\text{end}}}\tau})^{2}\right]. The protocol ends at this time.

Theorem 2.4 ([Aar05, Theorem 4]).

The discretized protocol with parameter ϵ\epsilon is an ϵ\epsilon-agreement protocol with transcript length O(1/ϵ)O(1/\epsilon) bits.

In general, we refer to Aaronson’s standard and discretized protocols as examples of expectation-sharing protocols. We will define other examples in Section 4, similar to Aaronson’s discretized protocol but with different cutoffs for low, medium, and high. We also interpret expectation-sharing protocols in the context of markets in Section 5.

2.3 Accuracy and Informational Substitutes

Most of our main results give conditions such that if Alice and Bob ϵ\epsilon-agree, then Alice’s and Bob’s estimates are accurate. By accurate, we mean that Alice’s and Bob’s expectations are close to μστ\mu_{\sigma\tau}, i.e., what they would believe if they knew each other’s signals. (After all, they cannot hope to have a better estimate of YY than μστ\mu_{\sigma\tau}; for this reason we sometimes refer to μστ\mu_{\sigma\tau} as the “truth.”) Formally:

Definition 2.5 (ϵ\epsilon-accurate).

Let aa be Alice’s expectation. Alice is ϵ\epsilon-accurate if 𝔼[(μστa)2]ϵ\mathbb{E}\left[(\mu_{\sigma\tau}-a)^{2}\right]\leq\epsilon. We define ϵ\epsilon-accuracy analogously for Bob.

One cannot hope for an unconditional result stating that if Alice and Bob agree, then they are accurate. Consider for instance the XOR information structure from the introduction: Alice and Bob each receive independent random bits as input, and YY is the XOR of these bits. Then from the start Alice and Bob agree that the expected value of YY is exactly 12\frac{1}{2}, but this value is far from μστ\mu_{\sigma\tau}, which is either 0 or 11.

Intuitively, this situation arises because Alice’s and Bob’s signals are informational complements: each signal is not informative by itself, but they are informative when taken together. On the other hand, we say that signals are informational substitutes if learning one signal is less valuable if you already know the other signal. An extreme example is if σ=τ=X\sigma=\tau=X for any random variable XX. Here σ\sigma becomes useless upon learning τ\tau and vice versa. In [CW16],444We recommend the ArXiv version for the most up-to-date introduction to informational substitutes. the authors discuss formalizations of several notions of informational substitutes. All of these notions capture “diminishing marginal value,” in the sense that, roughly speaking, the value of partial information is a submodular set function. The various definitions proposed by [CW16] only differ in how finely they allow decomposing σ\sigma and τ\tau to obtain a marginal unit. Our definition has the same format, but uses a decomposition inspired by information rectangles in communication complexity. Recall that we write S,T\mid S,T as shorthand for σS,τT\mid\sigma\in S,\tau\in T.

Definition 2.6.

An information structure =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) satisfies rectangle substitutes if it satisfies weak substitutes on every sub-rectangle, i.e., if for every S𝒮,T𝒯S\subseteq\mathcal{S},T\subseteq\mathcal{T}, we have

𝔼[(YμSτ)2S,T]𝔼[(Yμστ)2S,T]𝔼[(YμST)2S,T]𝔼[(YμσT)2S,T].\displaystyle\mathbb{E}\left[(Y-\mu_{S\tau})^{2}\mid S,T\right]-\mathbb{E}\left[(Y-\mu_{\sigma\tau})^{2}\mid S,T\right]\leq\mathbb{E}\left[(Y-\mu_{ST})^{2}\mid S,T\right]-\mathbb{E}\left[(Y-\mu_{\sigma T})^{2}\mid S,T\right].

We will show that under rectangle substitutes, if Alice and Bob approximately agree, then they are approximately accurate.

Interpreting substitutes.

Both sides of the inequality in Definition 2.6 represent the “value” of learning σ\sigma as measured by a decrease in error. The left-hand side gives the decrease if one already knows τ\tau and that σS\sigma\in S; the right-hand side gives the decrease if one only knows that σS,τT\sigma\in S,\tau\in T. Substitutes thus says: the marginal value of learning σ\sigma is smaller if one already knows τ\tau than if one does not. This statement should hold for every sub-rectangle S,TS,T. We remark that the inequality can be rearranged to focus instead on the marginal value of τ\tau rather than σ\sigma. We also note that in the XOR information structure, the left-hand side of the inequality is 14\frac{1}{4} while the right-hand side is zero: a large violation of the substitutes condition. In the example σ=τ=X\sigma=\tau=X, the left side is always zero.

[CW16] discusses three interpretations of substitutes, which motivate it as a natural condition. (1) Each side of the inequality measures an improvement in prediction error, here the squared loss, due to learning σ\sigma. Under substitutes, the improvement is smaller if one already knows τ\tau. (2) Each side measures a decrease in uncertainty due to learning σ\sigma. Under substitutes, σ\sigma provides less information about YY if one already knows τ\tau.555Here, uncertainty is measured by variance of one’s belief. Under the KL divergence analogue covered in Section 4.1, uncertainty is measured in bits via Shannon entropy. (3) Each side measures the decrease in distance of a posterior expectation from the truth when learning σ\sigma. The distance to YY changes less if one already knows τ\tau.

2.4 The Pythagorean Theorem

We will use the following fact throughout. We defer the proof to Appendix C, where we establish a more general version of this statement.

Proposition 2.7 (Pythagorean theorem).

Let AA be a random variable, B=𝔼[A]B=\mathbb{E}\left[A\mid\mathcal{F}\right] where \mathcal{F} is a sigma-algebra, and CC be a random variable defined on \mathcal{F}. Then

𝔼[(AC)2]=𝔼[(AB)2]+𝔼[(BC)2].\mathbb{E}\left[(A-C)^{2}\right]=\mathbb{E}\left[(A-B)^{2}\right]+\mathbb{E}\left[(B-C)^{2}\right].

We use the phrase Pythagorean theorem in part because of its form, and in part because it is precisely the familiar Pythagorean theorem when the random variables are viewed as points in a Hilbert space666We do not make use of this abstraction in our work, but we refer the interested reader to [Šid57]. with inner product X,Y:=𝔼[XY]\langle X,Y\rangle:=\mathbb{E}\left[XY\right].

Informally, AA is a random variable, BB is the expected value of AA conditional on some partial information, and CC is a random variable that only depends on this information. So the theorem applies when BB is a coarse estimate of AA and CC is at least as coarse as BB, a scenario that often occurs in our setting.

One application of the Pythagorean theorem in our context takes A=YA=Y, B=μστB=\mu_{\sigma\tau} (the expected value of YY conditioned on the experts’ signals), and C=μσTC=\mu_{\sigma T} (Alice’s expected value, which only depends on her signal and thus on the signal pair). This particular application, along with the symmetric one taking C=μSτC=\mu_{S\tau}, allows us to rewrite the rectangle substitutes condition in a form that we will find more convenient:

Remark 2.8.

An information structure \mathcal{I} satisfies rectangle substitutes if and only if

𝔼[(μστμSτ)2S,T]𝔼[(μσTμST)2S,T]\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\leq\mathbb{E}\left[(\mu_{\sigma T}-\mu_{ST})^{2}\mid S,T\right] (1)

for all S,TS,T.

3 Results for Squared Distance

Our main results show that, under the rectangle substitutes condition, any communication protocol that causes Alice and Bob to agree also causes them to be accurate. We now show the first of these results, which is specific to the squared distance error measure that we have been discussing.

3.1 Agreement Implies Accuracy

Theorem 3.1.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies rectangle substitutes. For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, Alice and Bob are 10ϵ1/310\epsilon^{1/3}-accurate after the protocol terminates.

The crux of the argument is the following lemma.

Lemma 3.2.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies rectangle substitutes. Let ϵ=𝔼[(μσμτ)2]\epsilon=\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]. Then

𝔼[(μστμτ)2]6ϵ1/3.\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq 6\epsilon^{1/3}.

Let us first prove Theorem 3.1 assuming Lemma 3.2 is true.

Proof of Theorem 3.1.

Consider any protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}. Let SS be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define TT likewise for Bob. Intuitively, S×TS\times T is the set of plausible signal pairs (σ,τ)(\sigma,\tau) according to an external observer of the protocol. Observe that SS and TT are random variables, each a function of both σ\sigma and τ\tau. We have

𝔼[(μστμSτ)2]\displaystyle\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right] =𝔼S,T[𝔼[(μστμSτ)2S,T]]\displaystyle=\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\right]
𝔼S,T[6(𝔼[(μσTμSτ)2S,T])1/3]\displaystyle\leq\mathbb{E}_{S,T}\left[6\left(\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right)^{1/3}\right]
6𝔼S,T[𝔼[(μσTμSτ)2S,T]]1/3\displaystyle\leq 6\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right]^{1/3}
=6𝔼[(μσTμSτ)2]1/3=6(4ϵ)1/310ϵ1/3.\displaystyle=6\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\right]^{1/3}=6(4\epsilon)^{1/3}\leq 10\epsilon^{1/3}.

In the second step, we apply Lemma 3.2 to the information structure \mathcal{I} restricted to S×TS\times T — that is, to =(Ω,,S,T,Y)\mathcal{I}^{\prime}=(\Omega^{\prime},\mathbb{P}^{\prime},S,T,Y), where Ω={ωΩ:σS,τT}\Omega^{\prime}=\{\omega\in\Omega:\sigma\in S,\tau\in T\} and [ω]=[ωσS,τT]\mathbb{P}^{\prime}[\omega]=\mathbb{P}\left[\omega\mid\sigma\in S,\tau\in T\right]. (Note that we use the fact that if \mathcal{I} satisfies rectangle substitutes, then so does \mathcal{I}^{\prime}; this is because a rectangle of \mathcal{I}^{\prime} is also a rectangle of \mathcal{I}.) The third step follows by the concavity of x1/3x^{1/3}. Therefore, Bob is 10ϵ1/310\epsilon^{1/3} accurate (and Alice is likewise by symmetry). ∎

The proof of Lemma 3.2 relies on the following claim. We defer the proof of Lemma 3.2 (and Claim 3.3) to Appendix B, and instead sketch the proofs here.

Claim 3.3.

For any N1N\geq 1, it is possible to partition [0,1][0,1] into NN intervals [0,x1),[x1,x2),,[0,x_{1}),[x_{1},x_{2}),\dots, [xN1,1][x_{N-1},1] in a way so that each interval has length at most 2N\frac{2}{N}, and

[k(σ)k(τ)]ϵN,\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sqrt{\epsilon}N,

where k(σ)k(\sigma) denotes the k[N]k\in[N] such that xk1μσ<xkx_{k-1}\leq\mu_{\sigma}<x_{k}, and k(τ)k(\tau) is defined analogously.777For convenience we define x0=0x_{0}=0 and xNx_{N} to be some number greater than 11.

Intuitively, Claim 3.3 is true because if 𝔼[(μσμτ)2]\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right] is small, then μσ\mu_{\sigma} and μτ\mu_{\tau} are likely to fall into the same interval.

We now sketch the proof of Lemma 3.2. To see why Claim 3.3 is relevant, recall that we wish to upper bound the expectation of (μστμτ)2(\mu_{\sigma\tau}-\mu_{\tau})^{2}. Let S(k):={σ𝒮:xk1μσ<xk}S^{(k)}:=\{\sigma\in\mathcal{S}:x_{k-1}\leq\mu_{\sigma}<x_{k}\}. By the Pythagorean theorem, we have

𝔼[(μστμτ)2]=𝔼[(μστμS(k(σ))τ)2]+𝔼[(μS(k(σ))τμτ)2].\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

By using the rectangle substitutes condition for S=S(k),T=𝒯S=S^{(k)},T=\mathcal{T} for every kk, we find that

𝔼[(μσμS(k(σ)))2]𝔼[(μστμS(k(σ))τ)2].\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]. (2)

Therefore, we have

𝔼[(μστμτ)2]𝔼[(μσμS(k(σ)))2]+𝔼[(μS(k(σ))τμτ)2].\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right]. (3)

Claim 3.3 lets us argue that the first of these two terms is small (because μσ\mu_{\sigma} and μS(k(σ))\mu_{S^{(k(\sigma))}} are always within 2N\frac{2}{N} of each other) and that the second term is also small (because conditioned on τ\tau, k(σ)k(\sigma) is known with high probability). We find that choosing N=ϵ1/6N=\epsilon^{1/6} gives us the bound in Lemma 3.2.

Theorem 3.1 is a general result about agreement protocols. Applying the result to Aaronson’s discretized protocol gives us the following result.

Corollary 3.4.

Let \mathcal{I} be any information structure that satisfies universal rectangle substitutes. For any ϵ>0\epsilon>0, Alice and Bob will be ϵ\epsilon-accurate after running Aaronson’s discretized protocol with parameter ϵ3/1000\epsilon^{3}/1000 (and this takes O(1/ϵ3)O(1/\epsilon^{3}) bits of communication).

Remark 3.5.

The discretized protocol is not always the most efficient agreement protocol. For example, Proposition B.1 shows that if the rectangle substitutes condition holds, agreement (and therefore accuracy) can be reached with just O(log(1/ϵ))O(\log(1/\epsilon)) bits, an improvement on Corollary 3.4. We discuss communication complexity further in Appendix E. Even if more efficient protocols are sometimes possible, expectation-sharing protocols are of interest because they model naturally-occurring communication processes. For example, they capture the dynamics of prices in markets, which we also discuss in Section 5. More generally, we find it remarkable that Alice and Bob become accurate by running the agreement protocol (indeed any agreement protocol), despite such protocols being designed with only agreement in mind.

Finally, we observe the following important consequence of Theorem 3.1: once Alice and Bob agree, they continue to agree.

Corollary 3.6.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies rectangle substitutes. Consider a communication protocol with the property that Alice and Bob ϵ\epsilon-agree after round tt. Then Alice and Bob 10ϵ1/310\epsilon^{1/3}-agree on all subsequent time steps.

Proof.

If Alice and Bob ϵ\epsilon-agree then they are 10ϵ1/310\epsilon^{1/3}-accurate, so in particular 𝔼[(μστμσTt)2]10ϵ1/3\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{t}})^{2}\right]\leq 10\epsilon^{1/3}. Note that 𝔼[(μστμσTs)2]\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s}})^{2}\right] is a decreasing function of ss, since for any s1s2s_{1}\leq s_{2} we have

𝔼[(μστμσTs1)2]=𝔼[(μστμσTs2)2]+𝔼[(μσTs2μσTs1)2]\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s_{1}}})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{s_{2}}})^{2}\right]+\mathbb{E}\left[(\mu_{\sigma T_{s_{2}}}-\mu_{\sigma T_{s_{1}}})^{2}\right]

by the Pythagorean theorem. Therefore, for any t>tt^{\prime}>t, we have 𝔼[(μστμσTt)2]10ϵ1/3\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\sigma T_{t^{\prime}}})^{2}\right]\leq 10\epsilon^{1/3}. Symmetrically, we have 𝔼[(μστμStτ)2]10ϵ1/3\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S_{t^{\prime}}\tau})^{2}\right]\leq 10\epsilon^{1/3}. Therefore, 𝔼[(μσTtμStτ)2]40ϵ1/3\mathbb{E}\left[(\mu_{\sigma T_{t^{\prime}}}-\mu_{S_{t^{\prime}}\tau})^{2}\right]\leq 40\epsilon^{1/3}, which means that after round tt^{\prime}, Alice and Bob 10ϵ1/310\epsilon^{1/3}-agree. ∎

Corollary 3.6 stands in contrast to the more general case, in which it is possible that Alice and Bob “nearly agree for the first t1t-1 time steps, then disagree violently at the tt-th step” [Aar05, §2.2]. Thus, while the main purpose of Theorem 3.1 is a property about accuracy, an agreement property falls out naturally: under the rectangle substitutes condition, once Alice and Bob are close to agreement, they will remain in relatively close agreement into the future.

3.2 Graceful Decay Under Closeness to Rectangle Substitutes

In a sense, the rectangle substitutes condition is quite strong: it requires that the weak substitutes condition be satisfied on every sub-rectangle. One might hope for a result that generalizes Theorem 3.1 to information structures that almost satisfy the rectangle substitutes condition but do not quite. Let us formally define a notion of closeness to rectangle substitutes.

Definition 3.7.

An information structure =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) satisfies δ\delta-approximate rectangle substitutes if for every partition of 𝒮×𝒯\mathcal{S}\times\mathcal{T} into rectangles,888There are partitions into rectangles that cannot arise from a communication protocol. Our results would apply equally if this condition were instead defined for every partition that could arise from a communication protocol, but we state this condition more generally so that it could be applicable in a broader context than the analysis of communication protocols. the rectangle substitutes condition holds in expectation over the partition, up to an additive constant of δ\delta, i.e., if we have

𝔼σ,τ[(μστμSσ,ττ)2]𝔼σ,τ[(μσTσ,τμSσ,τTσ,τ)2]+δ,\mathbb{E}_{\sigma,\tau}\left[(\mu_{\sigma\tau}-\mu_{S_{\sigma,\tau}\tau})^{2}\right]\leq\mathbb{E}_{\sigma,\tau}\left[(\mu_{\sigma T_{\sigma,\tau}}-\mu_{S_{\sigma,\tau}T_{\sigma,\tau}})^{2}\right]+\delta, (4)

where Sσ,τ×Tσ,τS_{\sigma,\tau}\times T_{\sigma,\tau} is the rectangle containing (σ,τ)(\sigma,\tau).

Remark 3.8.

The δ\delta-approximate rectangle substitutes property is a relaxation of the rectangle substitutes property, in the sense that the two are equivalent if δ=0\delta=0. To see this, first observe that if \mathcal{I} satisfies rectangle substitutes, then it satisfies Equation 4 with δ=0\delta=0 pointwise across all Sσ,τ,Tσ,τS_{\sigma,\tau},T_{\sigma,\tau}, and thus in expectation. In the other direction, suppose that \mathcal{I} satisfies 0-approximate rectangle substitutes. Let S𝒮,T𝒯S\subseteq\mathcal{S},T\subseteq\mathcal{T} and consider the partition of \mathcal{I} into rectangles that contains S×TS\times T and, separately, every other signal pair (σ,τ)(\sigma,\tau) in its own rectangle. For this partition, Equation 4 reduces precisely to Equation 1 (the rectangle substitutes condition for SS and TT).

Theorem 3.1 generalizes to approximate rectangle substitutes as follows.

Theorem 3.9.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies δ\delta-approximate rectangle substitutes. For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, Alice and Bob are (10ϵ1/3+δ)(10\epsilon^{1/3}+\delta)-accurate after the protocol terminates.

Proof.

We first observe that Lemma 3.2 can be modified as follows.

Lemma 3.10.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies δ\delta-approximate rectangle substitutes. Let ϵ=𝔼[(μσμτ)2]\epsilon=\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]. Then

𝔼[(μστμτ)2]6ϵ1/3+δ.\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq 6\epsilon^{1/3}+\delta.

The proof of Lemma 3.10 is exactly the same as that of Lemma 3.2, except that Equation 2 (Equation 6 in the full proof) includes an additive δ\delta term on the left-hand side:

𝔼[(μσμS(k(σ)))2]+δ𝔼[(μστμS(k(σ))τ)2].\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\delta\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right].

This modified inequality follows immediately from the δ\delta-approximate rectangle substitutes condition, noting that one partition of 𝒮×𝒯\mathcal{S}\times\mathcal{T} into rectangles is {S1×𝒯,,SN×𝒯}\{S_{1}\times\mathcal{T},\dots,S_{N}\times\mathcal{T}\}. The extra δ\delta term produces the δ\delta term in the lemma statement.

To prove the theorem, let SS be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define TT likewise for Bob. Let δST\delta_{ST} be the minimum δ\delta such that S×TS\times T satisfies δ\delta-approximate rectangle substitutes. Note that 𝔼S,T[δST]δ\mathbb{E}_{S,T}\left[\delta_{ST}\right]\leq\delta: otherwise, by taking the union over the worst-case partitions for each S,TS,T we would exhibit a partition of 𝒮×𝒯\mathcal{S}\times\mathcal{T} into rectangles that would violate the δ\delta-approximate rectangle substitutes property. Therefore we have

𝔼[(μστμSτ)2]\displaystyle\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right] =𝔼S,T[𝔼[(μστμSτ)2S,T]]\displaystyle=\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\mid S,T\right]\right]
𝔼S,T[6(𝔼[(μσTμSτ)2S,T])1/3+δST]\displaystyle\leq\mathbb{E}_{S,T}\left[6\left(\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right)^{1/3}+\delta_{ST}\right]
6𝔼S,T[𝔼[(μσTμSτ)2S,T]]1/3+δ\displaystyle\leq 6\mathbb{E}_{S,T}\left[\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\mid S,T\right]\right]^{1/3}+\delta
=6𝔼[(μσTμSτ)2]1/3+δ=6(4ϵ)1/3+δ10ϵ1/3+δ.\displaystyle=6\mathbb{E}\left[(\mu_{\sigma T}-\mu_{S\tau})^{2}\right]^{1/3}+\delta=6(4\epsilon)^{1/3}+\delta\leq 10\epsilon^{1/3}+\delta.

As in the proof of Theorem 3.1, the second step follows by applying Lemma 3.2 to the information structure \mathcal{I} restricted to S×TS\times T. ∎

4 Results for Other Divergence Measures

Squared distance is a compelling error measure because it elicits the mean. That is, if you wish to estimate a random variable YY and will be penalized according to the squared distance between YY and your estimate, the strategy that minimizes your expected penalty is to report the expected value of YY (conditional on the information you have). This is in contrast to e.g. absolute distance as an error measure, which would instead elicit the median of your distribution. The class of error measures that elicit the mean is precisely the class of Bregman divergences (defined below).

In this section, our main result is a generalization of Theorem 3.1 to (almost) arbitrary Bregman divergences (see e.g. Theorem 4.14). Additionally, we provide a generalization of Aaronson’s discretized protocol to arbitrary Bregman divergences (Theorem 4.11).

4.1 Preliminaries on Bregman Divergences

Definition 4.1.

Given a differentiable,999When we say “differentiable,” we mean differentiable on the interior of the interval on which GG is defined. strictly convex function G:[0,1]G:[0,1]\to\mathbb{R}, and x,y[0,1]x,y\in[0,1], the Bregman divergence from yy to xx is

DG(yx):=G(y)G(x)(yx)G(x).D_{G}(y\parallel x):=G(y)-G(x)-(y-x)G^{\prime}(x).
DG(yx)D_{G}(y\,\|\,x)G(x)+(x)G(x)G(x)+(\;\cdot\,-x)G^{\prime}(x)xxyyG()G(\cdot)
Figure 1: The Bregman divergence DG(yx)D_{G}(y\parallel x) is the vertical distance at yy between GG and the tangent line to GG at xx.
Proposition 4.2 ([Ban+05]).

Given a random variable YY, the quantity 𝔼[DG(Yx)]\mathbb{E}\left[D_{G}(Y\parallel x)\right] is minimized by x=𝔼[Y]x=\mathbb{E}\left[Y\right].

An intuitive formulation of Bregman divergence is that DG(yx)D_{G}(y\parallel x) can be found by drawing the line tangent to GG at xx and computing how far below the point (y,G(y))(y,G(y)) this line passes. We illustrate this in Figure 1. Note that the Bregman divergence is not in general symmetric in its arguments; indeed, G(x)=x2G(x)=x^{2} is the only GG for which it is.

The Bregman divergence with respect to G(x)=x2G(x)=x^{2} is precisely the squared distance. Another common Bregman divergence is the KL divergence, which corresponds to G(x)=xlogx+(1x)log(1x)G(x)=x\log x+(1-x)\log(1-x), the negative of Shannon entropy.

We generalize relevant notions such as agreement and accuracy to arbitrary Bregman divergences as follows. In the definitions below, G:[0,1]G:[0,1]\to\mathbb{R} is a differentiable, strictly convex function.

Definition 4.3.

Let aa be Alice’s expectation. Alice is ϵ\epsilon-accurate if 𝔼[DG(μστa)]ϵ\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel a)\right]\leq\epsilon, and likewise for Bob.

We discuss our choice of the order of these two arguments (i.e. why we do not instead consider the expectation of DG(aμστ)D_{G}(a\parallel\mu_{\sigma\tau})) in Appendix D. We now define ϵ\epsilon-agreement, and to do so we first define the Jensen-Bregman divergence.

Definition 4.4.

For a,b[0,1]a,b\in[0,1], the Jensen-Bregman divergence between aa and bb with respect to GG is

JBG(a,b):=12(DG(aa+b2)+DG(ba+b2))=G(a)+G(b)2G(a+b2).\text{JB}_{G}(a,b):=\frac{1}{2}\left(D_{G}\left(a\parallel\frac{a+b}{2}\right)+D_{G}\left(b\parallel\frac{a+b}{2}\right)\right)=\frac{G(a)+G(b)}{2}-G\left(\frac{a+b}{2}\right).

The validity of the second equality can be easily derived from the definition of Bregman divergence. Note that the Jensen-Bregman divergence, unlike the Bregman divergence, is symmetric in its arguments. The Jensen-Bregman divergence is a lower bound on the average Bregman divergence from Alice and Bob to any other point (see Proposition C.1 (i)).

Definition 4.5.

Let aa and bb be Alice’s and Bob’s expectations, respectively. Alice and Bob ϵ\epsilon-agree with respect to GG if JBG(a,b)ϵ\text{JB}_{G}(a,b)\leq\epsilon.

In Appendix D we discuss alternative definitions of agreement and accuracy. The upshot of this discussion is that our definition of agreement is the weakest reasonable one, and our definition of accuracy is the strongest reasonable one. This means that the main result of this section—that under a wide class of Bregman divergence, agreement implies accuracy—is quite powerful: it starts with a weak premise and proves a strong conclusion.

Definition 4.6.

Given an information structure \mathcal{I}, a communication protocol causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I} with respect to GG if Alice and Bob ϵ\epsilon-agree with respect to GG at the end of the protocol. A communication protocol is an ϵ\epsilon-agreement protocol with respect to GG if the protocol causes Alice and Bob to ϵ\epsilon-agree with respect to GG on every information structure.

We also generalize the notion of rectangle substitutes to this domain, following [CW16], which explored notions of substitutes for arbitrary Bregman divergences.

Definition 4.7.

Let G:[0,1]G:[0,1]\to\mathbb{R} be a differentiable, strictly convex function. An information structure =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) satisfies rectangle substitutes with respect to GG if for every S𝒮,T𝒯S\subseteq\mathcal{S},T\subseteq\mathcal{T}, we have

𝔼[DG(YμSτ)S,T]𝔼[DG(Yμστ)S,T]\displaystyle\mathbb{E}\left[D_{G}(Y\parallel\mu_{S\tau})\mid S,T\right]-\mathbb{E}\left[D_{G}(Y\parallel\mu_{\sigma\tau})\mid S,T\right]
𝔼[DG(YμST)S,T]𝔼[DG(YμσT)S,T].\displaystyle\leq\mathbb{E}\left[D_{G}(Y\parallel\mu_{ST})\mid S,T\right]-\mathbb{E}\left[D_{G}(Y\parallel\mu_{\sigma T})\mid S,T\right].

The Pythagorean theorem (Proposition 2.7) generalizes to arbitrary Bregman divergences:

Proposition 4.8.

Let AA be a random variable, B=𝔼[A]B=\mathbb{E}\left[A\mid\mathcal{F}\right] where \mathcal{F} is a sigma-algebra, and CC be a random variable defined on \mathcal{F}. Then

𝔼[DG(AC)]=𝔼[DG(AB)]+𝔼[DG(BC)].\mathbb{E}\left[D_{G}(A\parallel C)\right]=\mathbb{E}\left[D_{G}(A\parallel B)\right]+\mathbb{E}\left[D_{G}(B\parallel C)\right].

Although the proof of this observation is fairly straightforward, to our knowledge Proposition 4.8 is original to this work. We provide a proof in Appendix C. Just as we did with squared error, this general Pythagorean theorem allows us to rewrite the rectangle substitutes condition for Bregman divergences.

Remark 4.9.

An information structure \mathcal{I} satisfies rectangle substitutes with respect to GG if and only if for all S𝒮,T𝒯S\subseteq\mathcal{S},T\subseteq\mathcal{T} we have

𝔼[DG(μστμSτ)S,T]𝔼[DG(μσTμST)S,T].\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\mid S,T\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\mid S,T\right]. (5)

Given the interpretation of Bregman divergences as measures of error, we can interpret the left side as Bob’s expected error in predicting the truth while the right side is Charlie’s expected error when predicting Alice’s expectation. Both sides measure a prediction error due to not having Alice’s signal, but from different starting points.

4.2 Generalizing the Discretized Protocol

Later in this work, we will show that under some weak conditions, protocols that cause Alice and Bob to agree with respect to GG also cause Alice and Bob to be accurate with respect to GG. However, this raises an interesting question: are there protocols that cause Alice and Bob to agree with respect to GG? In particular, we are interested in natural expectation-sharing protocols. Aaronson’s discretized protocol is specific to G(x)=x2G(x)=x^{2}, and it is not immediately obvious how to generalize it. We present the following generalization.

Definition 4.10.

Let GG be a differentiable, strictly convex function, and let M:=maxxG(x)minxG(x)M:=\max_{x}G(x)-\min_{x}G(x). Choose ϵ>0\epsilon>0. In the discretized protocol with respect to GG with parameter ϵ\epsilon, on her turn (at time tt), Alice sends “medium” if DG(μσTt1μSt1Tt1)<ϵ2D_{G}(\mu_{\sigma T_{t-1}}\parallel\mu_{S_{t-1}T_{t-1}})<\frac{\epsilon}{2}, and otherwise either “low” or “high”, depending on whether μσTt\mu_{\sigma T_{t}} is smaller or larger (respectively) than μStTt\mu_{S_{t}T_{t}}. Bob acts analogously on his turn. At the start of the protocol, Alice and Bob use the information structure to independently compute the time tend24M(4M+ϵ)ϵ2t_{\text{end}}\leq\frac{24M(4M+\epsilon)}{\epsilon^{2}} that minimizes 𝔼[DG(μσTtendμStendτ)]\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t_{\text{end}}}}\parallel\mu_{S_{t_{\text{end}}}\tau})\right]. The protocol ends at this time.

Theorem 4.11.

The discretized protocol with respect to GG with parameter ϵ\epsilon is an ϵ\epsilon-agreement protocol that involves O(M(M+ϵ)ϵ2)O\left(\frac{M(M+\epsilon)}{\epsilon^{2}}\right) bits of communication.

Our proof draws inspiration from Aaronson’s proof of the discretized protocol, but has significant differences. The key idea is to keep track of the monovariant 𝔼[DG(YμStTt)]\mathbb{E}\left[D_{G}(Y\parallel\mu_{S_{t}T_{t}})\right]. This is Charlie’s expected error (as measured by the Bregman divergence from the correct answer YY) after time step tt—recall that Charlie is our name for a third-party observer of the protocol. Note that this quantity is at most MM and at least 0. Hence, if we show that the quantity decreases by at least some value β\beta every time Alice and Bob do not ϵ\epsilon-agree, then we will have shown that Alice and Bob must ϵ\epsilon-agree within βM\frac{\beta}{M} time steps. We defer the proof to Appendix C.

4.3 Approximate Triangle Inequality

Our results will hold for a class of Jensen-Bregman divergences that satisfy an approximate version of the triangle inequality. Specifically, we will require JBG\text{JB}_{G} to satisfy the following cc-approximate triangle inequality for some c>0c>0.

Definition 4.12.

Given a differentiable, strictly convex function G:[0,1]G:[0,1]\to\mathbb{R} and a positive number cc, we say that JBG(,)\text{JB}_{G}(\cdot,\cdot) satisfies the cc-approximate triangle inequality if for all a,b,x[0,1]a,b,x\in[0,1] we have

JBG(a,x)+JBG(x,b)cJBG(a,b).\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\geq c\text{JB}_{G}(a,b).

It is possible to construct functions GG such that there is no positive cc for which JBG\text{JB}_{G} satisfies the cc-approximate triangle inequality. However, JBG\text{JB}_{G} satisfies the cc-approximate triangle inequality for some positive cc for essentially all natural choices of GG.

Proposition 4.13.

Let G:[0,1]G:[0,1]\to\mathbb{R} be a differentiable, strictly convex function.

  1. (i)

    If JBG(,)\sqrt{\text{JB}_{G}(\cdot,\cdot)} satisfies the triangle inequality, then JBG\text{JB}_{G} satisfies the 12\frac{1}{2}-approximate triangle inequality.

  2. (ii)

    If G(x)=x2G(x)=x^{2} (i.e. DGD_{G} is squared distance) or if G(x)=xlogx+(1x)log(1x)G(x)=x\log x+(1-x)\log(1-x) (i.e. DGD_{G} is KL divergence), then JBG\sqrt{\text{JB}_{G}} satisfies the triangle inequality (and so JBG\text{JB}_{G} satisfies the 12\frac{1}{2}-approximate triangle inequality).

Proof.

Regarding Fact (i), suppose that JBG\sqrt{\text{JB}_{G}} satisfies the triangle inequality. Then for all a,b,xa,b,x we have JBG(a,x)+JBG(x,b)JBG(a,b)\sqrt{\text{JB}_{G}(a,x)}+\sqrt{\text{JB}_{G}(x,b)}\geq\sqrt{\text{JB}_{G}(a,b)}. Squaring both sides and observing that JBG(a,x)+JBG(x,b)2JBG(a,x)JBG(x,b)\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\geq 2\sqrt{\text{JB}_{G}(a,x)\text{JB}_{G}(x,b)} completes the proof.

Fact (ii) is trivial for G(x)=x2G(x)=x^{2}, since JBG\sqrt{\text{JB}_{G}} is the absolute distance metric (times a constant factor). As for G(x)=xlogx+(1x)log(1x)G(x)=x\log x+(1-x)\log(1-x), we refer the reader to [ES03]. ∎

The question of when JBG\sqrt{\text{JB}_{G}} satisfies the triangle inequality has been explored in previous work; we refer the interested reader to [ABB13] and [CCR08].

4.4 Generalized Agreement Implies Generalized Accuracy

In all of the results in this subsection, we consider the following setting: GG is a differentiable convex function; cc is a positive real number such that JBG\text{JB}_{G} satisfies the cc-approximate triangle inequality; and =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) is an information structure that satisfies rectangle substitutes with respect to GG.

We prove generalizations of Theorem 3.1, showing that under the rectangle substitutes condition, if a protocol ends with Alice and Bob in approximate agreement, then Alice and Bob are approximately accurate. The first result we state assumes that GG is symmetric, but is otherwise quite general.

Theorem 4.14.

Assume that GG is symmetric about the line x=12x=\frac{1}{2}. For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, and for any β2cϵ\beta\geq\frac{2}{c}\epsilon, Alice and Bob are

(8c2β+16(G(0)G((ϵβ)1/(1log2c))))-accurate\left(\frac{8}{c^{2}}\beta+16\left(G(0)-G\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right)\right)\text{-accurate}

after the protocol terminates.

This result is not our most general, as it assumes that GG is symmetric, but this assumption likely holds for most use cases. To apply the result optimally, one must first optimize β\beta as a function of GG. For example, setting β=ϵr/(r+1log2c)\beta=\epsilon^{r/(r+1-\log_{2}c)} (with rr defined below) gives us the following corollary:101010Corollary 4.15 as stated (without the symmetry assumption) is actually a corollary of Theorem 4.18.

Corollary 4.15.

Assume that G(0)G(x),G(1)G(1x)O(xr)G(0)-G(x),G(1)-G(1-x)\leq O(x^{r}). For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, Alice and Bob are O(ϵr/(r+1log2c))O\left(\epsilon^{r/(r+1-\log_{2}c)}\right)-accurate after the protocol terminates, where the constant hidden by O()O(\cdot) depends on GG.

Remark 4.16.

Concretely, if GG^{\prime} is bounded then we can choose r=1r=1, in which case our bound simplifies to O(ϵ1/(2log2c))O\left(\epsilon^{1/(2-\log_{2}c)}\right). If instead we assume that c=12c=\frac{1}{2} (as is the case if JBG(,)\sqrt{\text{JB}_{G}(\cdot,\cdot)} is a metric), then the bound is O(ϵr/(r+2))O\left(\epsilon^{r/(r+2)}\right). If both of these are true, as is the case for G(x)=x2G(x)=x^{2}, then the bound is O(ϵ1/3)O(\epsilon^{1/3}), which recovers our result in Theorem 3.1.

For GG equal to the negative of Shannon entropy (i.e. the GG for which DGD_{G} is KL divergence), setting β=ϵ1/3(log1/ϵ)2/3\beta=\epsilon^{1/3}(\log 1/\epsilon)^{2/3} in Theorem 4.14 gives us the following corollary.

Corollary 4.17.

If G(x)=xlogx+(1x)log(1x)G(x)=x\log x+(1-x)\log(1-x), then for any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, Alice and Bob are O(ϵ1/3(log1/ϵ)2/3)O(\epsilon^{1/3}(\log 1/\epsilon)^{2/3})-accurate after the protocol terminates.

Theorem 4.14 follows from our most general result about agreement implying accuracy:

Theorem 4.18.

Let G~(x):=maxa,b:|ab|x(G(a)G(b))\tilde{G}(x):=\max_{a,b:\left\lvert a-b\right\rvert\leq x}(G(a)-G(b)) be the maximum possible difference in GG-values of two points that differ by at most xx, and let G~(x)\tilde{G}^{*}(x) be the concave envelope of G~\tilde{G}, i.e.

G~(x):=max0a,b,w1:wa+(1w)b=xwG~(a)+(1w)G~(b).\tilde{G}^{*}(x):=\max_{0\leq a,b,w\leq 1:wa+(1-w)b=x}w\tilde{G}(a)+(1-w)\tilde{G}(b).

For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}, and for any β>0\beta>0, Alice and Bob are

(8c2β+16G~((ϵβ)1/(1log2c)))-accurate\left(\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right)\text{-accurate}

after the protocol terminates.

Proof.

To prove Theorem 4.18, it suffices to prove the following lemma.

Lemma 4.19.

Let GG be a differentiable convex function on [0,1][0,1] and c(0,1)c\in(0,1) be such that JBG\text{JB}_{G} satisfies the cc-approximate triangle inequality. Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies rectangle substitutes with respect to GG. Let ϵ=𝔼[JBG(μσμτ)]\epsilon=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma}\parallel\mu_{\tau})\right]. Then for any β>0\beta>0, we have

𝔼[DG(μστμτ)]8c2β+16G~((ϵβ)1/(1log2c)).\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right).

Let us first prove Theorem 4.18 assuming Lemma 4.19 is true.

Consider any protocol that causes Alice and Bob to ϵ\epsilon-agree on \mathcal{I}. Let SS be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define TT likewise for Bob.

Let ϵST=𝔼[JBG(μσT,μSτ)S,T]\epsilon_{ST}=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\mid S,T\right]. Note that

𝔼S,T[ϵST]=𝔼S,T[𝔼[JBG(μσT,μSτ)S,T]]=𝔼[JBG(μσT,μSτ)]ϵ.\mathbb{E}_{S,T}\left[\epsilon_{ST}\right]=\mathbb{E}_{S,T}\left[\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\mid S,T\right]\right]=\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T},\mu_{S\tau})\right]\leq\epsilon.

Therefore, for any β>0\beta>0 we have

𝔼[DG(μστμSτ)]\displaystyle\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\right] 𝔼S,T[8c2β+16G~((ϵSTβ)1/(1log2c))]\displaystyle\leq\mathbb{E}_{S,T}\left[\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon_{ST}}{\beta}\right)^{1/(1-\log_{2}c)}\right)\right]
8c2β+16G~(𝔼S,T[(ϵSTβ)1/(1log2c)])\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\mathbb{E}_{S,T}\left[\left(\frac{\epsilon_{ST}}{\beta}\right)^{1/(1-\log_{2}c)}\right]\right)
8c2β+16G~((𝔼S,T[ϵST]β)1/(1log2c))\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left({\left(\frac{\mathbb{E}_{S,T}\left[\epsilon_{ST}\right]}{\beta}\right)^{1/(1-\log_{2}c)}}\right)
8c2β+16G~((ϵβ)1/(1log2c)).\displaystyle\leq\frac{8}{c^{2}}\beta+16\tilde{G}^{*}\left(\left(\frac{\epsilon}{\beta}\right)^{1/(1-\log_{2}c)}\right).

In the first step, we apply Lemma 4.19 to the information structure \mathcal{I} restricted to S×TS\times T—that is, to =(Ω,,S,T,Y)\mathcal{I}^{\prime}=(\Omega^{\prime},\mathbb{P}^{\prime},S,T,Y), where Ω={ωΩ:σS,τT}\Omega^{\prime}=\{\omega\in\Omega:\sigma\in S,\tau\in T\} and [ω]=[ωσS,τT]\mathbb{P}^{\prime}[\omega]=\mathbb{P}\left[\omega\mid\sigma\in S,\tau\in T\right]. The next two steps follow by the convexity of G~\tilde{G}^{*} and x1/(1log2c)x^{1/(1-\log_{2}c)}, respectively. ∎

The basic outline of the proof of Lemma 4.19 is similar to that of Lemma 3.2. Once again, we partition [0,1][0,1] into NN intervals. Analogously to Equation 3, and with S(k(σ))S^{(k(\sigma))} defined analogously, we find that

𝔼[DG(μστμτ)]𝔼[DG(μσμS(k(σ)))]+𝔼[DG(μS(k(σ))τμτ)].\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right].

As before, we wish to upper bound each summand. However, the fact that the Bregman divergence is now arbitrary introduces complications. First, it is no longer the case that we can directly relate the length of an interval to the Bregman divergence between its endpoints. Second, we consider functions GG that become infinitely steep near 0 and 11 (such as the negative of Shannon entropy), which makes matters more challenging. This means that we need to be more careful when partitioning [0,1][0,1] into NN intervals: see Algorithm C.3 for our new approach. Additionally, bounding the second summand involves reasoning carefully about the behavior of the function GG, which is responsible for the introduction of G~\tilde{G}^{*} into the lemma statement. We defer the full proof of Lemma 4.19 to Appendix C.

5 Connections to Markets

In this work, we established a natural condition on information structures, rectangle substitutes, under which any agreement protocol results in accurate beliefs. As we saw, a particularly natural class of agreement protocols are expectation-sharing protocols, where Alice and Bob take turns stating their current expected value, or discretizations thereof.

Expectation-sharing protocols have close connections to financial markets. In markets, the actions of traders reveal partial information about their believed value for an asset, i.e., their expectation. Specifically, a trader’s decision about whether to buy or sell, and how much, can be viewed as revealing a discretization of this expectation. In many theoretical models of markets (see e.g. [Ost12]) traders eventually reach agreement. The intuition behind this phenomenon is that a trader who disagrees with the price leaves money on the table by refusing to trade. Our work thus provides a lens into a well-studied question:111111This is related to the efficient market hypothesis, the problem of when market prices reflect all available information, which traces back at least to [Fam70] and [Hay45]. Modern models of financial markets are often based on [Kyl85]; we refer the reader to e.g. [Ost12] and references therein for further information. when are market prices accurate?

An important caveat, however, is that traders behave strategically, and may not disclose their true expected value. For example, a trader may choose to withhold information until a later point when doing so would be more profitable. Therefore, to interpret the actions of traders as revealing discretized versions of their expected value, one first has to understand the Bayes-Nash equilibria of the market. [CW16] studies conditions under which traders are incentivized to reveal all of their information on their first trading opportunity. They call a market equilibrium all-rush if every trader is incentivized to reveal their information immediately. Their main result, roughly speaking, is that there is an all-rush equilibrium if and only if the information structure satisfies strong substitutes—a different strengthening of the weak substitutes condition. This result is specific to settings in which traders have the option to reveal all of their information on their turn—a setting that would be considered trivial from the standpoint of communication theory.

An exciting question for further study is therefore: under what information structure conditions and market settings is it a Bayes-Nash equilibrium to follow an agreement protocol leading to accurate beliefs? In other words, what conditions give not only that agreement implies accuracy, but also that the market incentivizes participants to follow the protocol? Together with [CW16], our work suggests that certain substitutes-like conditions could suffice.

References

  • [Aar05] Scott Aaronson “The complexity of agreement” In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, May 22-24, 2005 ACM, 2005, pp. 634–643 DOI: 10.1145/1060590.1060686
  • [ABB13] Sreangsu Acharyya, Arindam Banerjee and Daniel Boley “Bregman Divergences and Triangle Inequality” In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA SIAM, 2013, pp. 476–484 DOI: 10.1137/1.9781611972832.53
  • [Aum76] Robert J. Aumann “Agreeing to Disagree” In The Annals of Statistics 4.6 Institute of Mathematical Statistics, 1976, pp. 1236–1239 URL: http://www.jstor.org/stable/2958591
  • [Ban+05] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon and Joydeep Ghosh “Clustering with Bregman divergences” In Journal of Machine Learning Research 6, 2005, pp. 1705–1749
  • [CCR08] P. Chen, Y. Chen and M. Rao “Metrics defined by Bregman Divergences” In Communications in Mathematical Sciences 6.4 International Press of Boston, 2008, pp. 915–926 DOI: cms/1229619676
  • [CW16] Yiling Chen and Bo Waggoner “Informational Substitutes” In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA IEEE Computer Society, 2016, pp. 239–247 DOI: 10.1109/FOCS.2016.33
  • [ES03] Dominik Maria Endres and Johannes E. Schindelin “A new metric for probability distributions” In IEEE Trans. Inf. Theory 49.7, 2003, pp. 1858–1860 DOI: 10.1109/TIT.2003.813506
  • [Fam70] Eugene F. Fama “Efficient capital markets: A review of theory and empirical work” In The Journal of Finance 25.2 Blackwell Publishing for the American Finance Association, 1970, pp. 383–417
  • [Hay45] Friedrich August Hayek “The use of knowledge in society” In The American economic review 35.4 JSTOR, 1945, pp. 519–530
  • [Kyl85] Albert S. Kyle “Continuous auctions and insider trading” In Econometrica: Journal of the Econometric Society 53.6, 1985, pp. 1315–1335
  • [NR21] Eric Neyman and Tim Roughgarden “Are You Smarter Than a Random Expert? The Robust Aggregation of Substitutable Signals”, 2021
  • [NR21a] Eric Neyman and Tim Roughgarden “From Proper Scoring Rules to Max-Min Optimal Forecast Aggregation” In EC ’21: The 22nd ACM Conference on Economics and Computation, Budapest, Hungary, July 18-23, 2021 ACM, 2021, pp. 734 DOI: 10.1145/3465456.3467599
  • [Ost12] Michael Ostrovsky “Information aggregation in dynamic markets with strategic traders” In Econometrica 80.6 Wiley Online Library, 2012, pp. 2595–2647
  • [RY20] Anup Rao and Amir Yehudayoff “Communication Complexity: and Applications” Cambridge University Press, 2020
  • [Šid57] Z. Šidák “On Relations Between Strict-Sense and Wide-Sense Conditional Expectations” In Theory of Probability and Its Applications 2, 1957, pp. 267–272

Appendix A Details Omitted From Section 2

Above we stated that the weak substitutes condition is not sufficient for agreement to imply accuracy. To see this, consider the following information structure:

  • Nature flips a coin. Alice’s and Bob’s signals are each a pair consisting of the outcome of the coin flip and an additional bit:

  • If the coin lands heads, Alice and Bob are given highly correlated bits as part of their signals: the bits are (0,0)(0,0) and (1,1)(1,1) with probability 45% each and (1,0)(1,0) and (0,1)(0,1) with probability 5% each.

  • If the coin lands tails, Alice and Bob are given highly anticorrelated bits as part of their signals: the bits are (0,0)(0,0) and (1,1)(1,1) with probability 5% each and (1,0)(1,0) and (0,1)(0,1) with probability 45% each.

The value YY is the XOR of Alice’s and Bob’s bits. It can be verified that this information structure satisfies weak substitutes; the intuition is that a majority of the value comes from knowing the outcome of the coin flip, which can be inferred from Alice’s (or Bob’s) signal alone.

Alice and Bob 0-agree at the very start. However, they are not 0-accurate, since their expectations are either 10% or 90%, and the right answer is either 0 or 1. Therefore, weak substitutes alone is insufficient for agreement to imply accuracy.

Appendix B Details Omitted From Section 3

Proof of Claim 3.3.

We claim that in fact we can choose the xix_{i}’s so that each xix_{i} is in [iN12N,iN+12N]\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right]. This ensures that each interval has length at most 2N\frac{2}{N}.

For x[0,1]x\in[0,1], let ρ(x)\rho(x) be the probability that xx is between μσ\mu_{\sigma} and μτ\mu_{\tau}, inclusive. Note that [k(σ)k(τ)]i=1N1ρ(xi)\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sum_{i=1}^{N-1}\rho(x_{i}).

Observe that if xx is selected uniformly from [0,1][0,1], the expected value of ρ(x)\rho(x) is equal to |μσμτ|\left\lvert\mu_{\sigma}-\mu_{\tau}\right\rvert, because both quantities are equal to the probability that xx is between μσ\mu_{\sigma} and μτ\mu_{\tau}. Therefore, if (σ,τ)(\sigma,\tau) is additionally chosen according to \mathbb{P}, we have

𝔼x[0,1][ρ(x)]=𝔼[|μσμτ|]𝔼[(μσμτ)2]=ϵ.\mathbb{E}_{x\leftarrow[0,1]}\left[\rho(x)\right]=\mathbb{E}\left[\left\lvert\mu_{\sigma}-\mu_{\tau}\right\rvert\right]\leq\sqrt{\mathbb{E}\left[(\mu_{\sigma}-\mu_{\tau})^{2}\right]}=\sqrt{\epsilon}.

This means that

i=1N1𝔼x[iN12N,iN+12N][ρ(x)]=(N1)𝔼x[12N,112N][ρ(x)]ϵN.\sum_{i=1}^{N-1}\mathbb{E}_{x\leftarrow\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right]}\left[\rho(x)\right]=(N-1)\mathbb{E}_{x\leftarrow\left[\frac{1}{2N},1-\frac{1}{2N}\right]}\left[\rho(x)\right]\leq\sqrt{\epsilon}N.

Thus, if each xix_{i} is selected uniformly at random from [iN12N,iN+12N]\left[\frac{i}{N}-\frac{1}{2N},\frac{i}{N}+\frac{1}{2N}\right], the expected value of [k(σ)k(τ)]\mathbb{P}\left[k(\sigma)\neq k(\tau)\right] would be at most ϵN\sqrt{\epsilon}N. In particular this means that there exist choices of the xix_{i}’s such that [k(σ)k(τ)]ϵN\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq\sqrt{\epsilon}N. ∎

Proof of Lemma 3.2.

Fix a large positive integer NN (we will later find it optimal to set N=ϵ1/3N=\epsilon^{-1/3}). Consider a partition of [0,1][0,1] into NN intervals [0,x1),[x1,x2),,[xN1,1][0,x_{1}),[x_{1},x_{2}),\dots,[x_{N-1},1] satisfying the conditions of Claim 3.3. Let S(k):={σ𝒮:xk1μσ<xk}S^{(k)}:=\{\sigma\in\mathcal{S}:x_{k-1}\leq\mu_{\sigma}<x_{k}\}. Additionally, let k(σ)k(\sigma) and k(τ)k(\tau) be as defined in Claim 3.3.

Our goal is to upper bound the expectation of (μστμτ)2(\mu_{\sigma\tau}-\mu_{\tau})^{2}. In pursuit of this goal, we observe that by the Pythagorean theorem, we have

𝔼[(μστμτ)2]=𝔼[(μστμS(k(σ))τ)2]+𝔼[(μS(k(σ))τμτ)2].\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]=\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right].

We now use the rectangle substitutes assumption: for any kk, by applying Equation 1 to S=S(k)S=S^{(k)} and T=𝒯T=\mathcal{T}, we know that

𝔼[(μσμS(k))2σS(k)]𝔼[(μστμS(k)τ)2σS(k)].\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k)}})^{2}\mid\sigma\in S^{(k)}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k)}\tau})^{2}\mid\sigma\in S^{(k)}\right].

Taking the expectation over kk (i.e. choosing each kk with probability equal to [σS(k)]\mathbb{P}\left[\sigma\in S^{(k)}\right]), we have that

𝔼[(μσμS(k(σ)))2]𝔼[(μστμS(k(σ))τ)2].\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\geq\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S^{(k(\sigma))}\tau})^{2}\right]. (6)

Therefore, we have

𝔼[(μστμτ)2]𝔼[(μσμS(k(σ)))2]+𝔼[(μS(k(σ))τμτ)2].\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]+\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right]. (7)

We will argue use Claim 3.3 to argue that each of these two summands is small. The argument regarding the first summand is straightforward: for any σ\sigma, we have that xk(σ)μσ,μS(k(σ))<xk(σ)+1xk(σ)+2Nx_{k(\sigma)}\leq\mu_{\sigma},\mu_{S^{(k(\sigma))}}<x_{k(\sigma)+1}\leq x_{k(\sigma)}+\frac{2}{N}, which means that 𝔼[(μσμS(k(σ)))2]4N2\mathbb{E}\left[(\mu_{\sigma}-\mu_{S^{(k(\sigma))}})^{2}\right]\leq\frac{4}{N^{2}}.

We now upper bound the second summand.121212The proof below takes sums over τ^𝒯\hat{\tau}\in\mathcal{T} and thus implicitly assumes that 𝒯\mathcal{T} is finite, but the proof extends to infinite 𝒯\mathcal{T}, with sums over τ\tau replaced by integrals with respect to the probability measure over 𝒯\mathcal{T}. For any τ^𝒯\hat{\tau}\in\mathcal{T}, let p(τ^)=[τ=τ^]p(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau}\right] and q(τ^)=[τ=τ^,k(σ)k(τ)]q(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau},k(\sigma)\neq k(\tau)\right]. Then τ^𝒯p(τ^)=1\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})=1 and τ^𝒯q(τ^)ϵN\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})\leq\sqrt{\epsilon}N. Observe that

𝔼[(μS(k(σ))τμτ)2]\displaystyle\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right] =τ^p(τ^)𝔼[(μS(k(σ))τ^μτ^)2τ=τ^]\displaystyle=\sum_{\hat{\tau}}p(\hat{\tau})\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau}\right]
=τ^(p(τ^)q(τ^))𝔼[(μS(k(σ))τ^μτ^)2τ=τ^,k(σ)=k(τ^)]\displaystyle=\sum_{\hat{\tau}}(p(\hat{\tau})-q(\hat{\tau}))\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)=k(\hat{\tau})\right]
+q(τ^)𝔼[(μS(k(σ))τ^μτ^)2τ=τ^,k(σ)k(τ^)].\displaystyle\qquad+q(\hat{\tau})\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right]. (8)

To handle the first expectation, we note that if k(σ)=k(τ)k(\sigma)=k(\tau), then |μS(k(σ))τ^μτ^|q(τ^)p(τ^)\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}}\right\rvert\leq\frac{q(\hat{\tau})}{p(\hat{\tau})}. To see this, observe

p(τ^)μτ^=(p(τ^)q(τ^))μS(k(σ))τ^+q(τ^)μ𝒮S(k(σ))τ^.p(\hat{\tau})\mu_{\hat{\tau}}=(p(\hat{\tau})-q(\hat{\tau}))\mu_{S^{(k(\sigma))}\hat{\tau}}+q(\hat{\tau})\mu_{\mathcal{S}\setminus S^{(k(\sigma))}\hat{\tau}}~{}.

Rerranging and taking absolute values, we conclude

p(τ^)|μS(k(σ))τ^μτ^|=q(τ^)|μS(k(σ))τ^μ𝒮S(k(σ))|q(τ^).p(\hat{\tau})\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}}\right\rvert=q(\hat{\tau})\left\lvert\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\mathcal{S}\setminus S^{(k(\sigma))}}\right\rvert\leq q(\hat{\tau}).

Therefore, recalling q(τ^)p(τ^)q(\hat{\tau})\leq p(\hat{\tau}), we have

(p(τ^)q(τ^))𝔼[(μS(k(σ))τ^μτ^)2τ=τ^,k(σ)=k(τ^)](p(τ^)q(τ^))(q(τ^)p(τ^))2q(τ^)2p(τ^)q(τ^).\displaystyle(p(\hat{\tau})-q(\hat{\tau}))\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\hat{\tau}}-\mu_{\hat{\tau}})^{2}\mid\tau=\hat{\tau},k(\sigma)=k(\hat{\tau})\right]\leq(p(\hat{\tau})-q(\hat{\tau}))\left(\frac{q(\hat{\tau})}{p(\hat{\tau})}\right)^{2}\leq\frac{q(\hat{\tau})^{2}}{p(\hat{\tau})}\leq q(\hat{\tau}).

On the other hand, we can bound the second expectation in Equation B by 11. Therefore we have

𝔼[(μS(k(σ))τμτ)2]τ^(q(τ^)+q(τ^))=2τ^q(τ^)2ϵN.\mathbb{E}\left[(\mu_{S^{(k(\sigma))}\tau}-\mu_{\tau})^{2}\right]\leq\sum_{\hat{\tau}}(q(\hat{\tau})+q(\hat{\tau}))=2\sum_{\hat{\tau}}q(\hat{\tau})\leq 2\sqrt{\epsilon}N.

by Claim 3.3. To conclude, we now know that

𝔼[(μστμτ)2]4N2+2ϵN.\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{\tau})^{2}\right]\leq\frac{4}{N^{2}}+2\sqrt{\epsilon}N.

Setting N=ϵ1/6N=\epsilon^{-1/6} makes the right-hand side equal to 6ϵ1/36\epsilon^{1/3}, completing the proof. ∎

Proposition B.1.

Consider the following protocol, parametrized by ϵ>0\epsilon>0. Alice and Bob send their initial expectations to each other, rounding to the nearest multiple of ϵ\epsilon. This protocol entails communicating O(log1/ϵ)O(\log 1/\epsilon) bits. At the end of the protocol, Alice and Bob 2ϵ22\epsilon^{2}-agree and are ϵ2\epsilon^{2}-accurate (with respect to G(x)=x2G(x)=x^{2}).

Proof.

Let SS be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define TT likewise for Bob. Recall that we use 𝒮\mathcal{S} and 𝒯\mathcal{T} to denote the sets of all of Alice’s and Bob’s possible signals, respectively. We have

𝔼[(μστμSτ)2]𝔼[(μσμS)2]ϵ2,\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S})^{2}\right]\leq\epsilon^{2},

since μσ\mu_{\sigma} and μS)\mu_{S}) are guaranteed to be within ϵ\epsilon of each other by construction. Thus, Bob is ϵ2\epsilon^{2}-accurate, and likewise for Alice. By the 12\frac{1}{2}-approximate triangle inequality for G(x)=x2G(x)=x^{2}, it follows that Alice and Bob 2ϵ22\epsilon^{2}-agree. ∎

Appendix C Details Omitted From Section 4

Proof of Proposition 4.8.

Let g=Gg=G^{\prime}. We have

𝔼[DG(AB)]+𝔼[DG(BC)]𝔼[DG(AC)]\displaystyle\mathbb{E}\left[D_{G}(A\parallel B)\right]+\mathbb{E}\left[D_{G}(B\parallel C)\right]-\mathbb{E}\left[D_{G}(A\parallel C)\right]
=𝔼[G(A)G(B)(AB)g(B)+G(B)G(C)(BC)g(C)G(A)+G(C)+(AC)g(C)]\displaystyle=\mathbb{E}\left[G(A)-G(B)-(A-B)g(B)+G(B)-G(C)-(B-C)g(C)-G(A)+G(C)+(A-C)g(C)\right]
=𝔼[(AB)(g(C)g(B))]=𝔼[𝔼[(AB)(g(C)g(B))]]\displaystyle=\mathbb{E}\left[(A-B)(g(C)-g(B))\right]=\mathbb{E}\left[\mathbb{E}\left[(A-B)(g(C)-g(B))\mid\mathcal{F}\right]\right]
=𝔼[(g(C)g(B))𝔼[AB]]=𝔼[(g(C)g(B))(𝔼[A]B)]=0.\displaystyle=\mathbb{E}\left[(g(C)-g(B))\mathbb{E}\left[A-B\mid\mathcal{F}\right]\right]=\mathbb{E}\left[(g(C)-g(B))(\mathbb{E}\left[A\mid\mathcal{F}\right]-B)\right]=0.

The third-to-last step follows from the fact that g(C)g(B)g(C)-g(B) is \mathcal{F}-measurable (we are using the “pulling out known factors” property of conditional expectation). The last step follows from the fact that 𝔼[A]=B\mathbb{E}\left[A\mid\mathcal{F}\right]=B. ∎

Proposition C.1.

Let GG be a differentiable convex function on the interval [0,1][0,1]. For all 0ab10\leq a\leq b\leq 1, we have

  1. (i)

    12(DG(ax)+DG(bx))JBG(a,b)\frac{1}{2}(D_{G}(a\parallel x)+D_{G}(b\parallel x))\geq\text{JB}_{G}(a,b) for every x[0,1]x\in[0,1].

  2. (ii)

    JBG\text{JB}_{G} satisfies the reverse triangle inequality: for every x[a,b]x\in[a,b], we have JBG(a,x)+JBG(x,b)JBG(a,b)\text{JB}_{G}(a,x)+\text{JB}_{G}(x,b)\leq\text{JB}_{G}(a,b).

  3. (iii)

    For all aabba\leq a^{\prime}\leq b^{\prime}\leq b, we have JBG(a,b)JBG(a,b)\text{JB}_{G}(a^{\prime},b^{\prime})\leq\text{JB}_{G}(a,b).

  4. (iv)

    For a random variable XX supported on [a,b][a,b], we have

    𝔼[DG(X𝔼[X])]=𝔼[G(X)]G(𝔼[X])2JBG(a,b).\mathbb{E}\left[D_{G}(X\parallel\mathbb{E}\left[X\right])\right]=\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right])\leq 2\text{JB}_{G}(a,b).
Proof.

Fact (i) follows from Proposition 4.2. Regarding Fact (ii), without loss of generality assume that xa+b2x\leq\frac{a+b}{2} and that G(x)=G(a+b2)G(x)=G\left(\frac{a+b}{2}\right) (uniformly adding a constant to the derivative of GG does not change any Jensen-Bregman divergence, hence the second assumption). Then G(a+x2)G(x)G\left(\frac{a+x}{2}\right)\geq G(x), so JBG(a,x)G(a)G(x)2\text{JB}_{G}(a,x)\leq\frac{G(a)-G(x)}{2}. Since b+x2a+b2\frac{b+x}{2}\geq\frac{a+b}{2}, we also have that G(b+x2)G(x)G\left(\frac{b+x}{2}\right)\geq G(x), so JBG(b,x)G(b)G(x)2\text{JB}_{G}(b,x)\leq\frac{G(b)-G(x)}{2}. Thus, we have

JBG(a,x)+JBG(b,x)G(a)+G(b)2G(x)=G(a)+G(b)2G(a+b2)=JBG(a,b).\text{JB}_{G}(a,x)+\text{JB}_{G}(b,x)\leq\frac{G(a)+G(b)}{2}-G(x)=\frac{G(a)+G(b)}{2}-G\left(\frac{a+b}{2}\right)=\text{JB}_{G}(a,b).

Fact (iii) follows from Fact (ii): we have

JBG(a,b)=JBG(a,a)+JBG(a,b)+JBG(b,b)JBG(a,b).\text{JB}_{G}(a,b)=\text{JB}_{G}(a,a^{\prime})+\text{JB}_{G}(a^{\prime},b^{\prime})+\text{JB}_{G}(b^{\prime},b)\geq\text{JB}_{G}(a^{\prime},b^{\prime}).

Regarding the equality in Fact (iv), we have

𝔼[DG(X𝔼[X])]\displaystyle\mathbb{E}\left[D_{G}(X\parallel\mathbb{E}\left[X\right])\right] =𝔼[G(X)G(𝔼[X])(X𝔼[X])G(𝔼[X])]\displaystyle=\mathbb{E}\left[G(X)-G(\mathbb{E}\left[X\right])-(X-\mathbb{E}\left[X\right])G^{\prime}(\mathbb{E}\left[X\right])\right]
=𝔼[G(X)G(𝔼[X])]=𝔼[G(X)]G(𝔼[X]),\displaystyle=\mathbb{E}\left[G(X)-G(\mathbb{E}\left[X\right])\right]=\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right]),

where the first step follows from the fact that 𝔼[(X𝔼[X])G(𝔼[X])]=G(𝔼[X])𝔼[X𝔼[X]]\mathbb{E}\left[(X-\mathbb{E}\left[X\right])G^{\prime}(\mathbb{E}\left[X\right])\right]=G^{\prime}(\mathbb{E}\left[X\right])\mathbb{E}\left[X-\mathbb{E}\left[X\right]\right], and 𝔼[X𝔼[X]]=0\mathbb{E}\left[X-\mathbb{E}\left[X\right]\right]=0.

Regarding the inequality in Fact (iv), without loss of generality assume that 𝔼[X]a+b2\mathbb{E}\left[X\right]\leq\frac{a+b}{2}. By convexity we have that

G(a+b2)ba2b𝔼[X]G(𝔼[X])+a+b2𝔼[X]b𝔼[X]G(b),G\left(\frac{a+b}{2}\right)\leq\frac{\frac{b-a}{2}}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])+\frac{\frac{a+b}{2}-\mathbb{E}\left[X\right]}{b-\mathbb{E}\left[X\right]}G(b),

so

JBG(a,b)\displaystyle\text{JB}_{G}(a,b) =G(a)+G(b)2G(a+b2)\displaystyle=G(a)+G(b)-2G\left(\frac{a+b}{2}\right)
G(a)+G(b)bab𝔼[X]G(𝔼[X])a+b2𝔼[X]b𝔼[X]G(b)\displaystyle\geq G(a)+G(b)-\frac{b-a}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])-\frac{a+b-2\mathbb{E}\left[X\right]}{b-\mathbb{E}\left[X\right]}G(b)
=G(a)+𝔼[X]ab𝔼[X]G(b)bab𝔼[X]G(𝔼[X])\displaystyle=G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-\mathbb{E}\left[X\right]}G(b)-\frac{b-a}{b-\mathbb{E}\left[X\right]}G(\mathbb{E}\left[X\right])
=bab𝔼[X](b𝔼[X]baG(a)+𝔼[X]abaG(b)G(𝔼[X]))\displaystyle=\frac{b-a}{b-\mathbb{E}\left[X\right]}\left(\frac{b-\mathbb{E}\left[X\right]}{b-a}G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-a}G(b)-G(\mathbb{E}\left[X\right])\right)
b𝔼[X]baG(a)+𝔼[X]abaG(b)G(𝔼[X])𝔼[G(X)]G(𝔼[X]).\displaystyle\geq\frac{b-\mathbb{E}\left[X\right]}{b-a}G(a)+\frac{\mathbb{E}\left[X\right]-a}{b-a}G(b)-G(\mathbb{E}\left[X\right])\geq\mathbb{E}\left[G(X)\right]-G(\mathbb{E}\left[X\right]).

In the last step we use the fact that for a convex function ff and a random variable XX defined on an interval [a,b][a,b] with mean μ\mu, the maximum possible value of 𝔼[f(X)]\mathbb{E}\left[f(X)\right] is attained if XX is either aa or bb with the appropriate probabilities. ∎

Proof of Theorem 4.11.

Suppose that Alice and Bob do not ϵ\epsilon-agree at time step tt, and without loss of generality assume that the next turn (number t+1t+1) is Alice’s. We begin by observing that, by Proposition C.1 (i), we have

𝔼[DG(μσTtμStTt)+DG(μStτμStTt)]2𝔼[JBG(μσTt,μStτ)]>2ϵ.\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})+D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq 2\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma T_{t}},\mu_{S_{t}\tau})\right]>2\epsilon.

Therefore, either 𝔼[DG(μσTtμStTt)]2ϵ3\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3} or 𝔼[DG(μStτμStTt)]4ϵ3\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}.

Case 1:

𝔼[DG(μσTtμStTt)]2ϵ3\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3}. Let us use “hi,” “lo,” and “md” to denote the events that Alice says “high,” Alice says “low,” and Alice says “medium,” respectively. We have

2ϵ3\displaystyle\frac{2\epsilon}{3} 𝔼[DG(μσTtμStTt)]=𝔼[𝔼[DG(μσTtμStTt)St,Tt]]\displaystyle\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\mid S_{t},T_{t}\right]\right]
=𝔼[𝔼[DG(μσTtμStTt)𝟙hi or loSt,Tt]]+𝔼[𝔼[DG(μσTtμStTt)𝟙mdSt,Tt]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi or lo}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{md}}\mid S_{t},T_{t}\right]\right]
𝔼[𝔼[DG(μσTtμStTt)𝟙hi or loSt,Tt]]+ϵ2,\displaystyle\leq\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi or lo}}\mid S_{t},T_{t}\right]\right]+\frac{\epsilon}{2},

where “St,Tt\mid S_{t},T_{t}” is short for “σSt,τTt\mid\sigma\in S_{t},\tau\in T_{t},” a notation we use throughout the proof. We thus have

𝔼[𝔼[DG(μσTtμStTt)𝟙hiSt,Tt]]+𝔼[𝔼[DG(μσTtμStTt)𝟙loSt,Tt]]ϵ6.\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right]\geq\frac{\epsilon}{6}. (9)

We now make use of the following lemma.

Lemma C.2.

Suppose that turn t+1t+1 is Alice’s. Let “hi” denote the event that Alice says “high.” Let α:=𝔼[DG(μσTtμStTt)𝟙hiSt,Tt]\alpha:=\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]. Then

𝔼[DG(μSt+1Tt+1μStTt)𝟙hiSt,Tt]αϵ8M+2ϵ.\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\geq\frac{\alpha\epsilon}{8M+2\epsilon}.

The analogous statement is true if Alice says “low,” and likewise if it is instead Bob’s turn.

We assume Lemma C.2 and return to prove it afterward. This lemma translates Equation 9 into a statement about how much Charlie learns. Specifically, we have that

𝔼[DG(μSt+1Tt+1μStTt)]=𝔼[𝔼[DG(μSt+1Tt+1μStTt)St,Tt]]\displaystyle\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\mid S_{t},T_{t}\right]\right]
𝔼[𝔼[DG(μSt+1Tt+1μStTt)𝟙hiSt,Tt]]+𝔼[𝔼[DG(μSt+1Tt+1μStTt)𝟙loSt,Tt]]\displaystyle\geq\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right]
ϵ8M+2ϵ(𝔼[𝔼[DG(μσTtμStTt)𝟙hiSt,Tt]]+𝔼[𝔼[DG(μσTtμStTt)𝟙loSt,Tt]])\displaystyle\geq\frac{\epsilon}{8M+2\epsilon}(\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{hi}}\mid S_{t},T_{t}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[D_{G}(\mu_{\sigma T_{t}}\parallel\mu_{S_{t}T_{t}})\cdot\mathbbm{1}_{\text{lo}}\mid S_{t},T_{t}\right]\right])
ϵ26(8M+2ϵ).\displaystyle\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}.

Case 2:

𝔼[DG(μStτμStTt)]4ϵ3\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}. Using the Pythagorean theorem to write the same Bregman divergence in two ways, we have that

𝔼[DG(μSt+1τμSt+1Tt+1)]+𝔼[DG(μSt+1Tt+1μStTt)]=𝔼[DG(μSt+1τμStTt)]\displaystyle\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t+1}T_{t+1}})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]=\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t}T_{t}})\right]
=𝔼[DG(μSt+1τμStτ)]+𝔼[DG(μStτμStTt)]𝔼[DG(μStτμStTt)]4ϵ3.\displaystyle=\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t}\tau})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\mathbb{E}\left[D_{G}(\mu_{S_{t}\tau}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{4\epsilon}{3}.

This means that one of the two summands on the left-hand side is at least 2ϵ3\frac{2\epsilon}{3}.

Case 2a: 𝔼[DG(μSt+1τμSt+1Tt+1)]2ϵ3\mathbb{E}\left[D_{G}(\mu_{S_{t+1}\tau}\parallel\mu_{S_{t+1}T_{t+1}})\right]\geq\frac{2\epsilon}{3}. In that case we have that

𝔼[DG(μSt+2Tt+2μSt+1Tt+1)]ϵ26(8M+2ϵ)\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t+1}T_{t+1}})\right]\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}

by the same logic as in Case 1.

Case 2b: 𝔼[DG(μSt+1Tt+1μStTt)]2ϵ3ϵ212ϵϵ26(8M+2ϵ)\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{2\epsilon}{3}\geq\frac{\epsilon^{2}}{12\epsilon}\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}.

In each of our cases, we have that

𝔼[DG(YμStTt)DG(YμSt+2Tt+2)]=𝔼[DG(μSt+2Tt+2μStTt)]\displaystyle\mathbb{E}\left[D_{G}(Y\parallel\mu_{S_{t}T_{t}})-D_{G}(Y\parallel\mu_{S_{t+2}T_{t+2}})\right]=\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t}T_{t}})\right]
=𝔼[DG(μSt+2Tt+2μSt+1Tt+1)]+𝔼[DG(μSt+1Tt+1μStTt)]ϵ26(8M+2ϵ).\displaystyle=\mathbb{E}\left[D_{G}(\mu_{S_{t+2}T_{t+2}}\parallel\mu_{S_{t+1}T_{t+1}})\right]+\mathbb{E}\left[D_{G}(\mu_{S_{t+1}T_{t+1}}\parallel\mu_{S_{t}T_{t}})\right]\geq\frac{\epsilon^{2}}{6(8M+2\epsilon)}.

Therefore, the total number of steps until agreement is first reached cannot be more than

2Mϵ26(8M+2ϵ)=24M(4M+ϵ)ϵ2.2\cdot\frac{M}{\frac{\epsilon^{2}}{6(8M+2\epsilon)}}=\frac{24M(4M+\epsilon)}{\epsilon^{2}}.

This completes the proof. ∎

We now prove Lemma C.2.

Proof of Lemma C.2.

We will restrict our probability space to outcomes where Charlie knows St,TtS_{t},T_{t} at time tt (and thus omit “St,Tt\mid S_{t},T_{t}” from here on). For convenience, we will let A:=μσTtA:=\mu_{\sigma T_{t}} be Alice’s expectation (a random variable) and c:=μStTtc:=\mu_{S_{t}T_{t}} be Charlie’s expectation (which is a particular number in [0,1][0,1]). We will let ϵ:=ϵ2\epsilon^{\prime}:=\frac{\epsilon}{2}, so that if Alice says “high” then Charlie knows that A>cA>c and that DG(Ac)ϵD_{G}(A\parallel c)\geq\epsilon^{\prime}.

Let D(x):=DG(xc)=G(x)G(c)G(c)(xc)D(x):=D_{G}(x\parallel c)=G(x)-G(c)-G^{\prime}(c)(x-c), and let a^h:=𝔼[Ahi]\hat{a}_{h}:=\mathbb{E}\left[A\mid\text{hi}\right]. Note that if Alice says “high” then μSt+1Tt+1=a^h\mu_{S_{t+1}T_{t+1}}=\hat{a}_{h}. In our new notation, we may write α=𝔼[D(A)hi][hi]\alpha=\mathbb{E}\left[D(A)\mid\text{hi}\right]\cdot\mathbb{P}\left[\text{hi}\right], and we wish to show that D(a^h)[hi]αϵ2(M+ϵ)D(\hat{a}_{h})\cdot\mathbb{P}\left[\text{hi}\right]\geq\frac{\alpha\epsilon^{\prime}}{2(M+\epsilon^{\prime})}. Put otherwise, our goal is to show that

D(a^h)𝔼[D(A)hi]ϵ2(M+ϵ).\frac{D(\hat{a}_{h})}{\mathbb{E}\left[D(A)\mid\text{hi}\right]}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}.

For convenience we will let BB denote the quantity on the left-hand side.

Let ahmina_{\text{hmin}} be the number larger than cc such that D(a)=ϵD(a)=\epsilon^{\prime}, so that AahminA\geq a_{\text{hmin}} whenever Alice says “high.”131313If D(a)<ϵD(a)<\epsilon^{\prime} for all a>ca>c then Alice never says “high” and the lemma statement is trivial. Observe that since DD is convex (Bregman divergences are convex in their first argument), for a fixed value of a^h\hat{a}_{h}, the value of 𝔼[D(A)hi]\mathbb{E}\left[D(A)\mid\text{hi}\right] is maximized when AA is either ahmina_{\text{hmin}} or 11 (with probabilities 1a^h1ahmin\frac{1-\hat{a}_{h}}{1-a_{\text{hmin}}} and a^hahmin1ahmin\frac{\hat{a}_{h}-a_{\text{hmin}}}{1-a_{\text{hmin}}}, respectively). Therefore we have

B=D(a^h)𝔼[D(A)hi]D(a^h)(1ahmin)(1a^h)ϵ+(a^hahmin)D(1).B=\frac{D(\hat{a}_{h})}{\mathbb{E}\left[D(A)\mid\text{hi}\right]}\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{(1-\hat{a}_{h})\epsilon^{\prime}+(\hat{a}_{h}-a_{\text{hmin}})D(1)}. (10)

Case 1:

(1a^h)ϵ(a^hahmin)D(1)(1-\hat{a}_{h})\epsilon^{\prime}\geq(\hat{a}_{h}-a_{\text{hmin}})D(1). In that case we have

BD(a^h)(1ahmin)2(1a^h)ϵϵ(1ahmin)2(1a^h)ϵ12ϵ2(M+ϵ).B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(1-\hat{a}_{h})\epsilon^{\prime}}\geq\frac{\epsilon^{\prime}(1-a_{\text{hmin}})}{2(1-\hat{a}_{h})\epsilon^{\prime}}\geq\frac{1}{2}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}.

Case 2:

(1a^h)ϵ(a^hahmin)D(1)(1-\hat{a}_{h})\epsilon^{\prime}\leq(\hat{a}_{h}-a_{\text{hmin}})D(1). In that case we have

BD(a^h)(1ahmin)2(a^hahmin)D(1).B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})D(1)}. (11)

Case 2a: D(1)1ca^hc(M+ϵ)D(1)\leq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime}). Then we have

BD(a^h)(1ahmin)2(a^hahmin)1ca^hc(M+ϵ)ϵ2(M+ϵ)(1ahmin)(a^hc)(a^hahmin)(1c).B\geq\frac{D(\hat{a}_{h})(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})\cdot\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime})}\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}\cdot\frac{(1-a_{\text{hmin}})(\hat{a}_{h}-c)}{(\hat{a}_{h}-a_{\text{hmin}})(1-c)}.

(In the last step we again use that D(a^h)ϵD(\hat{a}_{h})\geq\epsilon.) Now, it is easy to verify that the second fraction is at least 11 (this comes down to the fact that ahminca_{\text{hmin}}\geq c), so we indeed have that Bϵ2(M+ϵ)B\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})}.

Case 2b: D(1)1ca^hc(M+ϵ)D(1)\geq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime}). We claim that for all xcx\geq c, we have that

D(x)xc1cD(1)M.D(x)\geq\frac{x-c}{1-c}D(1)-M. (12)

To see this, suppose for contradiction that for some xx we have D(x)<xc1cD(1)MD(x)<\frac{x-c}{1-c}D(1)-M. Then

G(x)G(c)G(c)(xc)\displaystyle G(x)-G(c)-G^{\prime}(c)(x-c) <xc1c(G(1)G(c)G(c)(1c))M\displaystyle<\frac{x-c}{1-c}(G(1)-G(c)-G^{\prime}(c)(1-c))-M
(1c)G(x)(1c)G(c)\displaystyle(1-c)G(x)-(1-c)G(c) <(xc)G(1)(xc)G(c)(1c)M\displaystyle<(x-c)G(1)-(x-c)G(c)-(1-c)M
G(x)+M\displaystyle G(x)+M <(1x)G(c)+(xc)G(1)1c.\displaystyle<\frac{(1-x)G(c)+(x-c)G(1)}{1-c}.

On the other hand, we have that both G(c)G(c) and G(1)G(1) are less than or equal to G(x)+MG(x)+M, by definition of MM. This means that

G(1),G(c)<(1x)G(c)+(xc)G(1)1cG(1),G(c)<\frac{(1-x)G(c)+(x-c)G(1)}{1-c}

but this implies that G(1)<G(c)G(1)<G(c) and that G(c)<G(1)G(c)<G(1), a contradiction.

Plugging in x=a^hx=\hat{a}_{h} into Equation 12, we find that

D(a^h)a^hc1cD(1)M.D(\hat{a}_{h})\geq\frac{\hat{a}_{h}-c}{1-c}D(1)-M.

Plugging this bound into Equation 11, we get that

B\displaystyle B (a^hc1cD(1)M)(1ahmin)2(a^hahmin)D(1)=1ahmin2(a^hahmin)a^hc1c(1Ma^hc1cD(1))\displaystyle\geq\frac{\left(\frac{\hat{a}_{h}-c}{1-c}D(1)-M\right)(1-a_{\text{hmin}})}{2(\hat{a}_{h}-a_{\text{hmin}})D(1)}=\frac{1-a_{\text{hmin}}}{2(\hat{a}_{h}-a_{\text{hmin}})}\cdot\frac{\hat{a}_{h}-c}{1-c}\left(1-\frac{M}{\frac{\hat{a}_{h}-c}{1-c}D(1)}\right)
1ahmin2(a^hahmin)a^hc1c(1MM+ϵ)ϵ2(M+ϵ),\displaystyle\geq\frac{1-a_{\text{hmin}}}{2(\hat{a}_{h}-a_{\text{hmin}})}\cdot\frac{\hat{a}_{h}-c}{1-c}\left(1-\frac{M}{M+\epsilon^{\prime}}\right)\geq\frac{\epsilon^{\prime}}{2(M+\epsilon^{\prime})},

where in the second-to-last step we use that D(1)1ca^hc(M+ϵ)D(1)\geq\frac{1-c}{\hat{a}_{h}-c}(M+\epsilon^{\prime}) and in the last step we again use the fact that (1ahmin)(a^hc)(a^hahmin)(1c)1\frac{(1-a_{\text{hmin}})(\hat{a}_{h}-c)}{(\hat{a}_{h}-a_{\text{hmin}})(1-c)}\geq 1. ∎

Proof of Lemma 4.19.


We will partition [0,1][0,1] into a number NN of small intervals I1=[x0=0,x1)I_{1}=[x_{0}=0,x_{1}), I2=[x1,x2)I_{2}=[x_{1},x_{2}), I3=[x2,x3)I_{3}=[x_{2},x_{3}), …, IN=[xN1,xN=1]I_{N}=[x_{N-1},x_{N}=1] with certain desirable properties (which we will describe below). For k[N]k\in[N], we will let S(k):={σ𝒮:μσIk}S^{(k)}:=\{\sigma\in\mathcal{S}:\mu_{\sigma}\in I_{k}\}. For a given σ𝒮\sigma\in\mathcal{S}, we will let k(σ)k(\sigma) be the kk such that σS(k)\sigma\in S^{(k)}.

Our goal is to upper bound the expectation of DG(μστμτ)D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau}). In pursuit of this goal, we observe that by Proposition 4.8 we have

𝔼[DG(μστμτ)]=𝔼[DG(μστμS(k(σ))τ)]+𝔼[DG(μS(k(σ))τμτ)].\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]=\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k(\sigma))}\tau})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]. (13)

Now, for any kk, by applying Equation 5 to S=S(k)S=S^{(k)} and T=𝒯T=\mathcal{T}, we know that

𝔼[DG(μσμS(k))S(k)]𝔼[DG(μστμS(k)τ)S(k)].\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right]\geq\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k)}\tau})\mid S^{(k)}\right].

(Here, “S(k)\mid S^{(k)}” is short for “σS(k)\mid\sigma\in S^{(k)}.”) This is our only use of the rectangle substitutes assumption. Now, taking the expectation over kk (i.e. choosing each kk with probability equal to [σS(k)]\mathbb{P}\left[\sigma\in S^{(k)}\right]), we have that

𝔼[DG(μσμS(k(σ)))]𝔼[DG(μστμS(k(σ))τ)].\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]\geq\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S^{(k(\sigma))}\tau})\right].

Together with Equation 13, this tells us that

𝔼[DG(μστμτ)]𝔼[DG(μσμS(k(σ)))]+𝔼[DG(μS(k(σ))τμτ)].\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]+\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]. (14)

Our goal will be to bound the two summands in Equation 14. We will specify the boundaries of the intervals I1,,INI_{1},\dots,I_{N} with this goal in mind.

On an intuitive level, we are hoping for two things to be true:

  • In order for the first summand to be small, we want μσ\mu_{\sigma} and μS(k(σ))\mu_{S^{(k(\sigma))}} to be similar in value. In other words, we want each interval is “short” (for a notion of shortness with respect to GG that we are about to discuss).

  • In order for the second summand to be small, we want μS(k(σ))τ\mu_{S^{(k(\sigma))}\tau} and μτ\mu_{\tau} to be similar in value. In other words, the estimate of a third party who knows τ\tau shouldn’t change much upon learning k(σ)k(\sigma). One way to ensure this is by creating the intervals in a way that makes the third party very confident about the value of k(σ)k(\sigma) before learning it. Intuitively this should be true because Alice and Bob approximately agree, so Alice’s estimate is likely to be close to Bob’s. However, we must be careful to strategically choose the boundaries of our intervals x1,,xN1x_{1},\dots,x_{N-1} so that Alice’s and Bob’s estimates are unlikely to be on opposite sides of a boundary.141414This limits how many intervals we can reasonably use, which is why we cannot make our intervals arbitrarily short to satisfy the first of our two criteria.

What, formally, do we need for the first summand to be small? For any kk, we have μS(k(σ))=𝔼[μσσS(k)]\mu_{S^{(k(\sigma))}}=\mathbb{E}\left[\mu_{\sigma}\mid\sigma\in S^{(k)}\right]. We can apply Proposition C.1 (iv) to the random variable X=μσX=\mu_{\sigma} on the probability subspace given by σS(k)\sigma\in S^{(k)}. Since XX takes on values in IkI_{k}, we have that

𝔼[DG(μσμS(k))S(k)]2JBG(Ik),\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right]\leq 2\text{JB}_{G}(I_{k}), (15)

where JBG(Ik)\text{JB}_{G}(I_{k}) is shorthand for the Jensen-Bregman divergence between the endpoints of IkI_{k}. Therefore, if JBG(Ik)\text{JB}_{G}(I_{k}) is small for all kk, then the first summand (which is an expected value of 𝔼[DG(μσμS(k))S(k)]\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k)}})\mid S^{(k)}\right] over k[N]k\in[N]) is also small.

What about the second summand? As per the intuition above, we wish to choose our boundary points x1,,xN1x_{1},\dots,x_{N-1} so that Alice’s and Bob’s estimates are unlikely to be on opposite sides of any boundary. Let μ=min(μσ,μτ)\mu_{-}=\min(\mu_{\sigma},\mu_{\tau}) be the smaller of the two estimates and μ+=max(μσ,μτ)\mu_{+}=\max(\mu_{\sigma},\mu_{\tau}) be the larger one. We say that μ,μ+\mu_{-},\mu_{+} thwart a point x(0,1)x\in(0,1) if μxμ+\mu_{-}\leq x\leq\mu_{+} and μμ+\mu_{-}\neq\mu_{+}. We define the thwart density of xx to be

ρ(x):=[μ,μ+ thwart x].\rho(x):=\mathbb{P}\left[\mu_{-},\mu_{+}\text{ thwart }x\right].

Roughly speaking, we will choose x1,,xN1x_{1},\dots,x_{N-1} such that ρ(xk)\rho(x_{k}) is small on average.

We will approach this problem by first creating intervals to satisfy the first criterion (short intervals), without regard to the second, and then modifying them to satisfy the second without compromising the first. Formally, we choose our intervals according to the following algorithm.

Algorithm C.3 (Partitioning [0,1][0,1] into intervals I1,,INI_{1},\dots,I_{N}).

  1. (1)

    Choose points 0<x1<x2<<xN2<10<x_{1}^{\prime}<x_{2}^{\prime}<\dots<x_{N-2}^{\prime}<1 such that the N1N-1 intervals thus created all have Jensen-Bregman divergence between β\beta and 2βc\frac{2\beta}{c}, inclusive, where β\beta and cc are as in the statement of Lemma 4.19. (NN is not pre-determined; it is defined as one more than the number of intervals created.) (See footnote for why this is possible.151515Define x1x_{1}^{\prime} so that JBG(0,x1)=2βc\text{JB}_{G}(0,x_{1}^{\prime})=\frac{2\beta}{c} (this is possible because JBG\text{JB}_{G} is continuous in its arguments). Define x2x_{2}^{\prime} so that JBG(x1,x2)=2βc\text{JB}_{G}(x_{1}^{\prime},x_{2}^{\prime})=\frac{2\beta}{c}. Keep going until an endpoint xN3x_{N-3}^{\prime} is defined such that adding xN2x_{N-2}^{\prime} as before would leave an interval (xN2,1)(x_{N-2}^{\prime},1) with Jensen-Bregman divergence less than 2βc\frac{2\beta}{c}. Now, instead of defining xN2x_{N-2}^{\prime} in this way, define it so that JBG(xN3,xN2)=JBG(xN2,1)\text{JB}_{G}(x_{N-3}^{\prime},x_{N-2}^{\prime})=\text{JB}_{G}(x_{N-2}^{\prime},1). Since JBG(xN3,1)2βc\text{JB}_{G}(x_{N-3}^{\prime},1)\geq\frac{2\beta}{c}, the cc-approximate triangle inequality that we have by assumption tells us that JBG(xN3,xN2)=JBG(xN2,1)β\text{JB}_{G}(x_{N-3}^{\prime},x_{N-2}^{\prime})=\text{JB}_{G}(x_{N-2}^{\prime},1)\geq\beta.)

  2. (2)

    Let x0:=0,xN1:=1x_{0}^{\prime}:=0,x_{N-1}^{\prime}:=1 for convenience. Define Ik:=[xk1,xk]I_{k}^{\prime}:=[x_{k-1}^{\prime},x_{k}^{\prime}]. For k[N1]k\in[N-1], let αk:=infxIkρ(x)\alpha_{k}:=\inf_{x\in I_{k}^{\prime}}\rho(x). Let xkIkx_{k}\in I_{k}^{\prime} be such161616If the infimum is achieved (e.g. if the space of signals to Alice and Bob is finite), then we can set xk:=argminxρ(x)x_{k}:=\arg\min_{x}\rho(x). Our algorithm works in more generality, at the expense of a factor of 22 in our final bound. Note that by replacing 22 with a smaller constant can arbitrarily reduce this factor. that ρ(xk)2αk\rho(x_{k})\leq 2\alpha_{k}.

  3. (3)

    Return the intervals I1=[0,x1),I2=[x1,x2),,IN=[xN1,1]I_{1}=[0,x_{1}),I_{2}=[x_{1},x_{2}),\dots,I_{N}=[x_{N-1},1].

We begin by observing that for any k[N]k\in[N], we have

JBG(Ik)=JBG(xk1,xk)JBG(xk2,xk)1c(JBG(xk2,xk1)+JBG(xk1,xk))4βc2\text{JB}_{G}(I_{k})=\text{JB}_{G}(x_{k-1},x_{k})\leq\text{JB}_{G}(x_{k-2}^{\prime},x_{k}^{\prime})\leq\frac{1}{c}(\text{JB}_{G}(x_{k-2}^{\prime},x_{k-1}^{\prime})+\text{JB}_{G}(x_{k-1}^{\prime},x_{k}^{\prime}))\leq\frac{4\beta}{c^{2}}

where for convenience we define x1:=0,xN:=1x_{-1}^{\prime}:=0,x_{N}^{\prime}:=1. Therefore, by Equation 15, we have

𝔼[DG(μσμS(k(σ)))]8βc2.\mathbb{E}\left[D_{G}(\mu_{\sigma}\parallel\mu_{S^{(k(\sigma))}})\right]\leq\frac{8\beta}{c^{2}}. (16)

It remains to bound the second summand of Equation 14, 𝔼[DG(μS(k(σ))τμτ)]\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right], which is the bulk of the proof. We proceed in two steps:

  1. (1)

    (Lemma C.4) We show that k=1Nαk\sum_{k=1}^{N}\alpha_{k} is small. This means that Alice’s and Bob’s estimates are unlikely to lie on opposite sides of some boundary point xkx_{k}. As a consequence, Bob is highly likely to know k(σ)k(\sigma) with a lot of confidence

  2. (2)

    (Lemma C.7) We bound the second summand as a function of k=1Nαk\sum_{k=1}^{N}\alpha_{k}. The intuition is that if kαk\sum_{k}\alpha_{k} is small, then Bob is highly likely to know k(σ)k(\sigma) with a lot of confidence, which means that he does not learn too much from learning k(σ)k(\sigma).

We begin with the first step; recall our notation μ:=min(μσ,μτ)\mu^{-}:=\min(\mu_{\sigma},\mu_{\tau}) and μ+:=max(μσ,μτ)\mu^{+}:=\max(\mu_{\sigma},\mu_{\tau}).

Lemma C.4.
2k=1Nαk4(ϵβc)1/(1log2c).2\sum_{k=1}^{N}\alpha_{k}\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.
Proof.

We use the following claim, whose proof we provide afterward.

Claim C.5.

Let I=[x,x+]I=[x^{-},x^{+}] be any sub-interval of [0,1][0,1] and let α=infxIρ(x)\alpha=\inf_{x\in I}\rho(x). Then there is an increasing sequence of points z0:=x,z1,z2,,zL1,zL:=x+z_{0}:=x^{-},z_{1},z_{2},\dots,z_{L-1},z_{L}:=x^{+}, such that for every [L]\ell\in[L], [μz1,μ+z]α2\mathbb{P}\left[\mu_{-}\leq z_{\ell-1},\mu_{+}\geq z_{\ell}\right]\geq\frac{\alpha}{2}, and where

L2α[L][μz1<μ+z].L\leq\frac{2}{\alpha}\sum_{\ell\in[L]}\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right].

We apply Claim C.5 to the intervals I1,,IN1I_{1}^{\prime},\dots,I_{N-1}^{\prime}, with α=αk\alpha=\alpha_{k}. Let zk,0,,zk,Lkz_{k,0},\dots,z_{k,L_{k}} be the points whose existence the claim proves, and let rk:=[Lk][μzk,1<μ+zk,]r_{k}:=\sum_{\ell\in[L_{k}]}\mathbb{P}\left[\mu_{-}\leq z_{k,\ell-1}<\mu_{+}\leq z_{k,\ell}\right], so that Lk2αkrkL_{k}\leq\frac{2}{\alpha_{k}}r_{k}. Observe that krk1\sum_{k}r_{k}\leq 1, because the intervals (zk,1,zk,](z_{k,\ell-1},z_{k,\ell}] are disjoint for all k,k,\ell. We make the following claim (we provide the proof afterward).

Claim C.6.
k[N1]rk(αk2rk)1log2cϵβc.\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\leq\frac{\epsilon}{\beta c}. (17)

We may rewrite Equation 17 as

(k[N1]rk(αk2rk)1log2c)1/(1log2c)(ϵβc)1/(1log2c).\left(\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\right)^{1/(1-\log_{2}c)}\leq\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

Recall that krk1\sum_{k}r_{k}\leq 1. Scaling the rkr_{k}’s to add to 11 decreases the left-hand side above, so we may assume that krk=1\sum_{k}r_{k}=1. Note that x1log2cx^{1-\log_{2}c} is convex. Thus, by using a weighted Jensen inequality on the left-hand side with weights rkr_{k}, we find that

12kαk=krkαk2rk(k[N1]rk(αk2rk)1log2c)1/(1log2c)(ϵβc)1/(1log2c).\frac{1}{2}\sum_{k}\alpha_{k}=\sum_{k}r_{k}\cdot\frac{\alpha_{k}}{2r_{k}}\leq\left(\sum_{k\in[N-1]}r_{k}\left(\frac{\alpha_{k}}{2r_{k}}\right)^{1-\log_{2}c}\right)^{1/(1-\log_{2}c)}\leq\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.

This completes the proof of Lemma C.4. ∎

Proof of Claim C.5.

Let z1=inf{z:[μz0<μ+z]α2}z_{1}=\inf\{z:\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\leq z\right]\geq\frac{\alpha}{2}\}, or x+x^{+} if this number does not exist or is larger than x+x^{+}. Note that [μz0<μ+]α\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\right]\geq\alpha, as we have ρ(z0)=[μz0<μ+]+[μ<z0=μ+]α\rho(z_{0})=\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\right]+\mathbb{P}\left[\mu_{-}<z_{0}=\mu_{+}\right]\geq\alpha, so if the first term were less than α\alpha we would have some z>z0z^{\prime}>z_{0} with ρ(z)<α\rho(z^{\prime})<\alpha. On the other hand, [μz0<μ+<z1]α2\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}<z_{1}\right]\leq\frac{\alpha}{2}, since

[μz0<μ+<z1]=limzz1 from below[μz0<μ+z]\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}<z_{1}\right]=\lim_{z\to z_{1}\text{ from below}}\mathbb{P}\left[\mu_{-}\leq z_{0}<\mu_{+}\leq z\right]

and if the right-hand side were more than α2\frac{\alpha}{2} then that would contradict the definition of z1z_{1} as an infimum. Therefore, [μz0,μ+z1]α2\mathbb{P}\left[\mu_{-}\leq z_{0},\mu_{+}\geq z_{1}\right]\geq\frac{\alpha}{2}.

If z1=x+z_{1}=x^{+}, we are done. Otherwise, let z2=inf{z:[μz1<μ+z]α2}z_{2}=\inf\{z:\mathbb{P}\left[\mu_{-}\leq z_{1}<\mu_{+}\leq z\right]\geq\frac{\alpha}{2}\}. Then [μz1,μ+z2]α2\mathbb{P}\left[\mu_{-}\leq z_{1},\mu_{+}\geq z_{2}\right]\geq\frac{\alpha}{2}. Define z3z_{3} analogously, and so forth.

All that remains to show is the upper bound on LL. This is where we use the fact that (by construction) [μz1<μ+z]α2\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right]\geq\frac{\alpha}{2}. Summing over all \ell, we have

[L][μz1<μ+z]α2L,\sum_{\ell\in[L]}\mathbb{P}\left[\mu_{-}\leq z_{\ell-1}<\mu_{+}\leq z_{\ell}\right]\geq\frac{\alpha}{2}L,

which (after rearranging) completes the proof. ∎

Proof of Claim C.6.

First note that by construction, JBG(Ik)β\text{JB}_{G}(I_{k}^{\prime})\geq\beta for all kk. By repeated use of the cc-approximate triangle inequality,171717We sub-divide IkI_{k}^{\prime} into [zk,0,zk,Lk/2][z_{k,0},z_{k,L_{k}/2}] and [zk,Lk/2,zk,L][z_{k,L_{k}/2},z_{k,L}], then subdivide each of these, and so on. we find that

[Lk]JBG(zk,1,zk,)clog2LkJBG(Ik)c1+log22rkαkJBG(Ik)c1+log22rkαkβ=c(2rkαk)log2cβ.\sum_{\ell\in[L_{k}]}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})\geq c^{\left\lceil\log_{2}L_{k}\right\rceil}\text{JB}_{G}(I_{k}^{\prime})\geq c^{1+\log_{2}\frac{2r_{k}}{\alpha_{k}}}\text{JB}_{G}(I_{k}^{\prime})\geq c^{1+\log_{2}\frac{2r_{k}}{\alpha_{k}}}\beta=c\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c}\beta.

On the other hand, we have

ϵ\displaystyle\epsilon 𝔼[JBG(μσ,μτ)]=σ,τ[σ,τ]JBG(μσ,μτ)σ,τ[σ,τ]k,:μzk,1μ+zk,JBG(zk,1,zk,)\displaystyle\geq\mathbb{E}\left[\text{JB}_{G}(\mu_{\sigma},\mu_{\tau})\right]=\sum_{\sigma,\tau}\mathbb{P}\left[\sigma,\tau\right]\text{JB}_{G}(\mu_{\sigma},\mu_{\tau})\geq\sum_{\sigma,\tau}\mathbb{P}\left[\sigma,\tau\right]\sum_{\begin{subarray}{c}k,\ell:\mu_{-}\leq z_{k,\ell-1}\\ \mu_{+}\geq z_{k,\ell}\end{subarray}}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})
=k,[μzk,1,μ+zk,]JBG(zk,1,zk,)k,αk2JBG(zk,1,zk,).\displaystyle=\sum_{k,\ell}\mathbb{P}\left[\mu_{-}\leq z_{k,\ell-1},\mu_{+}\geq z_{k,\ell}\right]\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell})\geq\sum_{k,\ell}\frac{\alpha_{k}}{2}\text{JB}_{G}(z_{k,\ell-1},z_{k,\ell}).

Here, the third step follows by the reverse triangle inequality (Fact (ii) of Proposition C.1) and the fourth step follows by rearranging the order of summation.181818The case that the space of signals is infinite is identical except that the summation is replaced by an integral over the probability space. Combining the last two facts gives us that

ϵkαk2c(2rkαk)log2cβ=krk(2rkαk)log2c1βc,\epsilon\geq\sum_{k}\frac{\alpha_{k}}{2}\cdot c\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c}\beta=\sum_{k}r_{k}\left(\frac{2r_{k}}{\alpha_{k}}\right)^{\log_{2}c-1}\beta c,

which rearranges to the desired identity. ∎

We are now ready to bound the second summand, i.e. 𝔼[DG(μS(k(σ))τμτ)]\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right], where k(σ)k(\sigma) is the kk such that Alice’s estimate μσ\mu_{\sigma} lies in IkI_{k}. For convenience we will define k(τ)k(\tau) for Bob by analogy as the kk such that μτ\mu_{\tau} lies in IkI_{k}. By Lemma C.4 and the preceding discussion, we know that

[k(σ)k(τ)]4(ϵβc)1/(1log2c).\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}.
Lemma C.7.

Let Q=[k(σ)k(τ)]Q=\mathbb{P}\left[k(\sigma)\neq k(\tau)\right]. Then

𝔼[DG(μS(k(σ))τμτ)]2G~(Q).\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]\leq 2\tilde{G}^{*}(Q).

The key idea is that because k(σ)=k(τ)k(\sigma)=k(\tau) with probability near 11, learning k(σ)k(\sigma) is unlikely to make Bob update his estimate much.

Proof.

Consider any signal τ^𝒯\hat{\tau}\in\mathcal{T} and let p(τ^)=[τ=τ^]p(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau}\right]. We have191919This proof takes sums over τ^𝒯\hat{\tau}\in\mathcal{T} and thus implicitly assumes that 𝒯\mathcal{T} is finite, but the proof extends to infinite 𝒯\mathcal{T}, with sums over τ\tau replaced by integrals with respect to the probability measure over 𝒯\mathcal{T}.

𝔼[DG(μS(k(σ))τμτ)]=τ^𝒯p(τ^)𝔼[DG(μS(k(σ))τ^μτ^)τ=τ^].\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]=\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\hat{\tau}}\parallel\mu_{\hat{\tau}})\mid\tau=\hat{\tau}\right].

Note that μτ^=𝔼[μS(k(σ))τ^τ=τ^]\mu_{\hat{\tau}}=\mathbb{E}\left[\mu_{S^{(k(\sigma))}\hat{\tau}}\mid\tau=\hat{\tau}\right], so by Proposition C.1 we have that

𝔼[DG(μS(k(σ))τμτ)]=τ^𝒯p(τ^)(𝔼[G(μS(k(σ))τ^)τ=τ^]G(μτ^)).\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right]=\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\left(\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau}\right]-G(\mu_{\hat{\tau}})\right).

Let q(τ^)=[τ=τ^,k(σ)k(τ^)]q(\hat{\tau})=\mathbb{P}\left[\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right], so τ^𝒯q(τ^)=Q\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})=Q. Then

𝔼[G(μS(k(σ))τ^)τ=τ^]G(μτ^)=\displaystyle\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau}\right]-G(\mu_{\hat{\tau}})=
p(τ^)q(τ^)p(τ^)(𝔼[G(μS(k(τ^))τ^)G(μτ^)])+q(τ^)p(τ^)(𝔼[G(μS(k(σ))τ^)τ=τ^,k(σ)k(τ^)]G(μτ^)).\displaystyle\frac{p(\hat{\tau})-q(\hat{\tau})}{p(\hat{\tau})}\left(\mathbb{E}\left[G(\mu_{S^{(k(\hat{\tau}))}\hat{\tau}})-G(\mu_{\hat{\tau}})\right]\right)+\frac{q(\hat{\tau})}{p(\hat{\tau})}\left(\mathbb{E}\left[G(\mu_{S^{(k(\sigma))}\hat{\tau}})\mid\tau=\hat{\tau},k(\sigma)\neq k(\hat{\tau})\right]-G(\mu_{\hat{\tau}})\right).

The second term is at most q(τ^)p(τ^)M\frac{q(\hat{\tau})}{p(\hat{\tau})}M, since MM is the range of GG. To bound the first term, we note that μS(k(τ^))τ^\mu_{S^{(k(\hat{\tau}))}\hat{\tau}} cannot differ from μτ^\mu_{\hat{\tau}} by more than q(τ^)p(τ^)q(τ^)\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}, as otherwise the average value of μS(k(σ))τ^\mu_{S^{(k(\sigma))}\hat{\tau}} could not be μτ^\mu_{\hat{\tau}}. Therefore, 𝔼[G(μS(k(τ^))τ^)G(μτ^)]\mathbb{E}\left[G(\mu_{S^{(k(\hat{\tau}))}\hat{\tau}})-G(\mu_{\hat{\tau}})\right] is bounded by the largest possible difference in GG-values of two points that differ by at most q(τ^)p(τ^)q(τ^)\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}. Therefore, we have

𝔼[DG(μS(k(σ))τμτ)]\displaystyle\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right] τ^𝒯p(τ^)(p(τ^)q(τ^)p(τ^)G~(q(τ^)p(τ^)q(τ^))+q(τ^)p(τ^)M)\displaystyle\leq\sum_{\hat{\tau}\in\mathcal{T}}p(\hat{\tau})\left(\frac{p(\hat{\tau})-q(\hat{\tau})}{p(\hat{\tau})}\tilde{G}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right)+\frac{q(\hat{\tau})}{p(\hat{\tau})}M\right)
QM+τ^𝒯(p(τ^)q(τ^))G~(q(τ^)p(τ^)q(τ^)),\displaystyle\leq QM+\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\tilde{G}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right),

where G~\tilde{G} is defined as in the statement of Lemma C.7. If GG is symmetric on [0,1][0,1], then G~(x)=G(0)G(x)\tilde{G}(x)=G(0)-G(x) for x12x\leq\frac{1}{2} and MM otherwise. This is a concave function, but G~\tilde{G} is not in general concave. However, consider G~\tilde{G}^{*} as defined in the lemma statement, so G~(x)G~(x)\tilde{G}(x)\leq\tilde{G}^{*}(x) for all xx. Then

𝔼[DG(μS(k(σ))τμτ)]\displaystyle\mathbb{E}\left[D_{G}(\mu_{S^{(k(\sigma))}\tau}\parallel\mu_{\tau})\right] QM+τ^𝒯(p(τ^)q(τ^))G~(q(τ^)p(τ^)q(τ^))\displaystyle\leq QM+\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\tilde{G}^{*}\left(\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})}\right)
QM+(τ^𝒯(p(τ^)q(τ^)))G~(τ^𝒯q(τ^)τ^𝒯(p(τ^)q(τ^)))\displaystyle\leq QM+\left(\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))\right)\cdot\tilde{G}^{*}\left(\frac{\sum_{\hat{\tau}\in\mathcal{T}}q(\hat{\tau})}{\sum_{\hat{\tau}\in\mathcal{T}}(p(\hat{\tau})-q(\hat{\tau}))}\right)
=QM+(1Q)G~(Q1Q)QM+G~(Q)2G~(Q).\displaystyle=QM+(1-Q)\tilde{G}^{*}\left(\frac{Q}{1-Q}\right)\leq QM+\tilde{G}^{*}(Q)\leq 2\tilde{G}^{*}(Q).

Here, the second step follows by Jensen’s inequality with terms q(τ^)p(τ^)q(τ^)\frac{q(\hat{\tau})}{p(\hat{\tau})-q(\hat{\tau})} and weights p(τ^)q(τ^)p(\hat{\tau})-q(\hat{\tau}), the second-to-last step follows from the fact that G~\tilde{G}^{*} is convex and G~(0)=0\tilde{G}^{*}(0)=0, and the last step follows from the fact that G~\tilde{G}^{*} is convex and G~(1)=M\tilde{G}^{*}(1)=M. ∎

Since Q4(ϵβc)1/(1log2c)Q\leq 4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}, combining Lemma C.7 with Equation 16 gives us the following result.

𝔼[DG(μστμτ)]8βc2+2G~(4(ϵβc)1/(1log2c)).\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{\tau})\right]\leq\frac{8\beta}{c^{2}}+2\tilde{G}^{*}\left(4\left(\frac{\epsilon}{\beta c}\right)^{1/(1-\log_{2}c)}\right).

Noting that G~\tilde{G}^{*} is concave and c1/(1log2c)2c^{-1/(1-\log_{2}c)}\leq 2 (which is true for all 0<c<10<c<1) completes the proof of Lemma 4.19. ∎

Appendix D Alternative Definitions of Agreement and Accuracy

For arbitrary Bregman divergences, there are several notions of agreement and accuracy that are worth considering. Before we discuss these, we make a note about the order of arguments in a Bregman divergence. In our context, it makes the most sense to talk of the Bregman divergence from a more informed estimate to a less informed estimate. By a “more informed estimate” we mean a finer-grained one, i.e. one that is informed by more knowledge. For example, in terms of estimating YY in the context of this work explores, YY is more informed than μστ\mu_{\sigma\tau}, which is more informed than μσ\mu_{\sigma} and μSτ\mu_{S\tau}, which are each more informed than μST\mu_{ST}, which is more informed than 𝔼[Y]\mathbb{E}\left[Y\right].

To see that this is the natural order of the arguments, recall that Bregman divergences are motivated by the property that they elicit the mean (see Proposition 4.2): if an agent who gives an estimate of xx for the value of a random variable YY incurs a loss of DG(Yx)D_{G}(Y\parallel x), then the agent minimizes their expected loss by reporting x=𝔼[Y]x=\mathbb{E}\left[Y\right]. This means that the expert ought to report the expected value of YY given the information that the expert knows.

This means that given two estimates of YY, Z1Z_{1} and Z2Z_{2}, of which Z1Z_{1} is more informed, the quantity DG(Z1Z2)D_{G}(Z_{1}\parallel Z_{2}) has a natural interpretation: it is the expected amount the expert gains by learning more and refining their estimate from Z2Z_{2} to Z1Z_{1}. This follows by the Pythagorean theorem:

𝔼[DG(Z1Z2)]=𝔼[DG(YZ2)]𝔼[DG(YZ1)].\mathbb{E}\left[D_{G}(Z_{1}\parallel Z_{2})\right]=\mathbb{E}\left[D_{G}(Y\parallel Z_{2})\right]-\mathbb{E}\left[D_{G}(Y\parallel Z_{1})\right].

D.1 Alternative Definitions of Agreement

One important motivation for using the Jensen-Bregman divergence to the midpoint as the definition of agreement is that this quantity serves as a lower bound on the expected amount that Charlie disagrees with Alice and Bob. Formally:

Definition D.1.

Let aa, bb, and cc be Alice’s, Bob’s, and Charlie’s expectations, respectively (these are random variables on Ω\Omega). Alice and Bob ϵ\epsilon-agree with Charlie if 12(𝔼[DG(ac)+DG(bc)])ϵ\frac{1}{2}(\mathbb{E}\left[D_{G}(a\parallel c)+D_{G}(b\parallel c)\right])\leq\epsilon.

(This is the order of arguments because Alice and Bob are more informed than Charlie.) By Proposition C.1 (i), we know that if Alice and Bob ϵ\epsilon-agree with Charlie then they ϵ\epsilon-agree.

As it happens the fact that under this (stronger) definition of agreement implies accuracy under rectangle substitutes follows immediately:

Proposition D.2.

Let =(Ω,,𝒮,𝒯,Y)\mathcal{I}=(\Omega,\mathbb{P},\mathcal{S},\mathcal{T},Y) be an information structure that satisfies rectangle substitutes. For any communication protocol that causes Alice and Bob to ϵ\epsilon-agree with Charlie on \mathcal{I}, Alice and Bob are 2ϵ2\epsilon-accurate after the protocol terminates.

Proof.

Let SS be the set of possible signals of Alice at the end of the protocol which are consistent with the protocol transcript, and define TT likewise for Bob. Recall that Charlie’s expectation is μST\mu_{ST}. We have

𝔼[DG(μστμSτ)]𝔼[DG(μσTμST)]𝔼[DG(μσTμST)]+𝔼[DG(μSτμST)]2ϵ,\mathbb{E}\left[D_{G}(\mu_{\sigma\tau}\parallel\mu_{S\tau})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\right]\leq\mathbb{E}\left[D_{G}(\mu_{\sigma T}\parallel\mu_{ST})\right]+\mathbb{E}\left[D_{G}(\mu_{S\tau}\parallel\mu_{ST})\right]\leq 2\epsilon,

where the first inequality follows by rectangle substitutes and the last inequality follows because Alice and Bob ϵ\epsilon-agree with Charlie. ∎

The drawback of Definition D.1 is that it is not so much a definition of Alice and Bob’s agreement with each other, so much as a definition of agreement with respect to the protocol being run (since Charlie only exists within the context of the protocol). Put otherwise, it is impossible to determine whether Alice and Bob ϵ\epsilon-agree with Charlie simply by knowing Alice and Bob’s expectations; one must also know Charlie’s expectation, which cannot be determined from Alice’s and Bob’s expectations. The question “how far from agreement are Alice and Bob if Alice believes 25% and Bob believes 30%?” makes sense in the context of ϵ\epsilon-agreement, but not in the context of ϵ\epsilon-agreement with Charlie.

A different notion of agreement, which (like ϵ\epsilon-agreement) only depends on Alice’s and Bob’s expectations, uses the symmetrized Bregman divergence between these expectations: 12(DG(ab)+DG(ba))\frac{1}{2}(D_{G}(a\parallel b)+D_{G}(b\parallel a)).

Definition D.3.

Let aa and bb be Alice’s and Bob’s expectations, respectively (these are random variables on Ω\Omega). Alice and Bob satisfy symmetrized ϵ\epsilon-agreement if 12(DG(ab)+DG(ba))\frac{1}{2}(D_{G}(a\parallel b)+D_{G}(b\parallel a)).

By Proposition C.1 (iii), we know that if Alice and Bob satisfy symmetrized ϵ\epsilon-agreement then they ϵ\epsilon-agree.

In our context, symmetrized Bregman divergence is less natural than Jensen-Bregman divergence. This is symmetrized Bregman divergence (unlike Jensen-Bregman divergence) does not seem to closely relate to our previous discussion of the Bregman divergence from a more informed to a less informed estimate being most natural.

D.2 Alternative Notions of Accuracy

Our definition of Alice’s accuracy as the expected Bregman divergence from the truth μστ\mu_{\sigma\tau} to Alice’s expectation seems like the most natural one. However, one may desire a definition of accuracy that takes both Alice’s and Bob’s expectations into account, judging the pair’s accuracy based on their consensus belief, rather than each of their individual beliefs. For instance, one could say that Alice and Bob are ϵ\epsilon-midpoint-accurate if 𝔼[DG(μστa+b2)]ϵ\mathbb{E}\left[D_{G}\left(\mu_{\sigma\tau}\parallel\frac{a+b}{2}\right)\right]\leq\epsilon. By this definition, Alice’s and Bob’s expectations could individually be far from the truth, but they are considered accurate because the average of their expectations is close to correct.

Proposition D.4.

If Alice and Bob are ϵ\epsilon-accurate, then they are 2ϵ2\epsilon-midpoint-accurate.

Proof.

Observe that for all a,b,ya,b,y it is the case that

DG(ya+b2)max(DG(ya),DG(yb)DG(ya)+DG(yb).D_{G}\left(y\parallel\frac{a+b}{2}\right)\leq\max(D_{G}(y\parallel a),D_{G}(y\parallel b)\leq D_{G}(y\parallel a)+D_{G}(y\parallel b).

The first inequality is true simply because a+b2\frac{a+b}{2} lies in between aa and bb. Therefore,

𝔼[DG(ya+b2)]𝔼[DG(ya)+DG(yb)]2ϵ.\mathbb{E}\left[D_{G}\left(y\parallel\frac{a+b}{2}\right)\right]\leq\mathbb{E}\left[D_{G}(y\parallel a)+D_{G}(y\parallel b)\right]\leq 2\epsilon.

Another natural choice for Alice’s and Bob’s consensus belief is the QA pool (see [NR21a]). Proposition D.4 likewise holds for the QA pool in place of the midpoint, and indeed holds for any choice of consensus belief that is guaranteed to lie in between Alice’s and Bob’s expectations. Thus, any such definition will be weaker than our definition of ϵ\epsilon-accuracy for Alice and Bob (up to a constant factor).

To summarize, among the above definitions of agreement, ϵ\epsilon-agreement is the weakest; and among the above definitions of accuracy, Alice’s and Bob’s ϵ\epsilon-accuracy is the strongest. This is an indication of strength for Theorem 4.18: it starts from a relatively weak premise and reaches a relatively strong conclusion.

Appendix E Implications for Communication Complexity

Our results can be framed in a communication complexity context, where they imply that “substitutable” functions can be computed with probability 1δ1-\delta (over the inputs) with a transcript length depending only on δ\delta. This is a nonstandard and weak notion of computing the function, but sketching the reduction may inspire future work on connections between substitutes and communication complexity.

In a classic deterministic communication complexity setup (e.g. [RY20]), Alice holds σ𝒮\sigma\in\mathcal{S}, Bob holds τ𝒯\tau\in\mathcal{T}, and the goal is to compute some function g:𝒮×𝒯{0,1}g:\mathcal{S}\times\mathcal{T}\to\{0,1\} using a communication protocol (see Section 2.2). Our setting captures this model when Y=g(σ,τ)Y=g(\sigma,\tau). Observe that in this case, Y=μστY=\mu_{\sigma\tau}, i.e. Alice and Bob’s information together determine YY completely. A communication protocol defines its output by a function h:Π{0,1}h:\Pi\to\{0,1\} where Π\Pi is the space of transcripts. We can simply let h(π)=round(μST)h(\pi)=\text{round}(\mu_{ST}), i.e. rounding the ex post expectation 𝔼[Yπ]=μST\mathbb{E}\left[Y\mid\pi\right]=\mu_{ST} to either zero or one. This is equivalent to the belief of “Charlie”, or the common knowledge of Alice and Bob after the protocol is completed.

Definition E.1 (Rectangle substitutes, (1δ)(1-\delta)-computes).

Given a function gg and a distribution 𝒟\mathcal{D} over 𝒮×𝒯\mathcal{S}\times\mathcal{T}, we say (g,𝒟)(g,\mathcal{D}) satisfy rectangle substitutes if the corresponding information structure with Y=g(σ,τ)Y=g(\sigma,\tau) satisfies rectangle substitutes (Definition 2.6). We say a protocol (1δ)(1-\delta)-computes gg over 𝒟\mathcal{D} if, with probability at least 1δ1-\delta over (σ,τ)𝒟(\sigma,\tau)\sim\mathcal{D}, the protocol has h(π)=g(σ,τ)h(\pi)=g(\sigma,\tau).

By our results, under rectangle substitutes (g,𝒟)(g,\mathcal{D}), any agreement protocol approximately computes gg over 𝒟\mathcal{D}. More precisely, using a fast substitutes-agreement protocol similar to Proposition B.1, we obtain the following.

Corollary E.2.

Suppose (g,𝒟)(g,\mathcal{D}) satisfy rectangle substitutes. Then for every δ(0,1)\delta\in(0,1), there is a deterministic communication protocol using O(log(1/δ))O(\log(1/\delta)) bits of communication that (1δ)(1-\delta)-computes gg over 𝒟\mathcal{D}.

Proof.

In round one, Alice sends her current expectation μσ\mu_{\sigma} rounded to a multiple of ϵ\epsilon; call this message AA. In round two, Bob sends his updated expectation μSτ\mu_{S\tau} rounded to a multiple of ϵ\epsilon; call this message BB. The protocol then halts, and the output is BB rounded to either zero or one. It uses O(log(1/ϵ))O(\log(1/\epsilon)) bits. Let S,TS,T be the random rectangle associated with the protocol.

By construction, |μσA|ϵ|\mu_{\sigma}-A|\leq\epsilon, and μS\mu_{S} is the expectation of YY conditioned on AA, so it follows that |μσμS|ϵ|\mu_{\sigma}-\mu_{S}|\leq\epsilon. Using substitutes (just as in Proposition B.1),

𝔼[(μστμSτ)2]𝔼[(μσμS)2]ϵ2.\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]\leq\mathbb{E}\left[(\mu_{\sigma}-\mu_{S})^{2}\right]\leq\epsilon^{2}.

By construction, |BμSτ|ϵ|B-\mu_{S\tau}|\leq\epsilon. Therefore, by the 12\tfrac{1}{2}-approximate triangle inequality for squared distance (e.g. Proposition 4.13)),

𝔼[(μστB)2]2𝔼[(μστμSτ)2]+2𝔼[(μSτB)2]2ϵ2.\mathbb{E}\left[(\mu_{\sigma\tau}-B)^{2}\right]~{}\leq~{}2\mathbb{E}\left[(\mu_{\sigma\tau}-\mu_{S\tau})^{2}\right]+2\mathbb{E}\left[(\mu_{S\tau}-B)^{2}\right]~{}\leq~{}2\epsilon^{2}.

Now, the protocol is incorrect if |Bμστ|12|B-\mu_{\sigma\tau}|\geq\tfrac{1}{2}. Using Markov’s inequality,

Pr[|Bμστ|12]\displaystyle\Pr[|B-\mu_{\sigma\tau}|\geq\tfrac{1}{2}] =Pr[(Bμστ)214]\displaystyle=\Pr[(B-\mu_{\sigma\tau})^{2}\geq\tfrac{1}{4}]
4𝔼[(Bμστ)2]\displaystyle\leq 4\mathbb{E}\left[(B-\mu_{\sigma\tau})^{2}\right]
8ϵ2.\displaystyle\leq 8\epsilon^{2}.

Therefore, given δ(0,1)\delta\in(0,1), we run the protocol with ϵ=δ/8\epsilon=\sqrt{\delta/8}. The probability of an incorrect output is at most δ\delta, and we use O(log(1/ϵ)=O(log(1/δ))O(\log(1/\epsilon)=O(\log(1/\delta)) bits of communication. ∎