Continual Mean Estimation Under User-Level Privacy

Anand Jerry George^* École Polytechnique Fédérale de Lausanne Lekshmi Ramesh^† Indian Institute of Science, Bangalore Aditya Vikram Singh^‡ Indian Institute of Science, Bangalore Himanshu Tyagi^¶ Indian Institute of Science, Bangalore

Abstract

We consider the problem of continually releasing an estimate of the population mean of a stream of samples that is user-level differentially private (DP). At each time instant, a user contributes a sample, and the users can arrive in arbitrary order. Until now these requirements of continual release and user-level privacy were considered in isolation. But, in practice, both these requirements come together as the users often contribute data repeatedly and multiple queries are made. We provide an algorithm that outputs a mean estimate at every time instant $t$ such that the overall release is user-level $\varepsilon$ -DP and has the following error guarantee: Denoting by $M_{t}$ the maximum number of samples contributed by a user, as long as $\tilde{\Omega}(1/\varepsilon)$ users have $M_{t}/2$ samples each, the error at time $t$ is $\tilde{O}(1/\sqrt{t}+\sqrt{M}_{t}/t\varepsilon)$ . This is a universal error guarantee which is valid for all arrival patterns of the users. Furthermore, it (almost) matches the existing lower bounds for the single-release setting at all time instants when users have contributed equal number of samples.

1 Introduction

Aggregate queries over data sets were originally believed to maintain the privacy of data contributors. However, over the past two decades several attacks have been proposed to manipulate the output of aggregate queries to get more information about an individual user’s data [NS08], [SAW13], [GAM19]. To address this, several mechanisms have been proposed to release a noisy output instead of the original query output. Remarkably, these mechanisms have been shown to preserve privacy under a mathematically rigorous privacy requirement condition called differential privacy [Dwo+06]. But, until recently, the analysis assumed that each user contributes one data-point and not multiple. Furthermore, the data set was assumed to be static and only one query was made. Our goal in this paper is to address a more practical situation where multiple queries are made and the data set keeps on getting updated between two queries, using contributions from existing or new users. We provide an almost optimal private mechanism for this case for the specific running-average query, which can easily be adopted to answer more general aggregate queries as well.

1.1 Problem formulation and some heuristics

Consider a stream $(x_{1},\ldots,x_{T})$ of $T$ data points contributed by $n$ users, where $T\geq n$ . For simplicity, we assume that one point is contributed at each time: $x_{t}\in\mathbb{R}^{d}$ is contributed by a user $u_{t}\in[n]$ at time $t$ . The maximum number of samples contributed by any user till time instant $t$ is denoted by $M_{t}$ , and we assume that $M_{t}\leq m$ , i.e., each user contributes at most $m$ samples.

We formulate our query release problem as a statistical estimation problem to identify an optimal mechanism. Specifically, we assume that each $x_{t}$ is drawn independently from a distribution $P$ on $\mathbb{R}^{d}$ with unknown mean $\mu$ . At each time step $t\in[T]$ , we are required to output an estimate $\hat{\mu}_{t}$ for the mean of $P$ , while guaranteeing that the sequence of outputs $(\hat{\mu}_{1},\ldots,\hat{\mu}_{t})$ is user-level $\varepsilon$ -differentially private ( $\varepsilon$ -DP). Namely, the output is $\varepsilon$ -DP with respect to the user input comprising all the data points contributed by a single user [Ami+19].

A naive application of standard differentially private mechanisms at each time step will lead to error rates with suboptimal dependence on the time window and the number of samples contributed by each user. For instance, consider the most basic setting where users draw samples independently from a Bernoulli distribution with parameter $\mu$ . At any time $t\in[T]$ , a single user can affect the sample mean $(1/t)\sum_{i=1}^{t}x_{i}$ by at most $M_{t}/t$ . Adding Laplace noise with parameter $M_{t}/(t\varepsilon)$ to this running sum will therefore guarantee user-level $\varepsilon$ -DP at each step (implying $\varepsilon t$ -DP over $t$ steps) and an error of $O(1/\sqrt{t}+M_{t}/t\varepsilon)$ . Rescaling the privacy parameter, we get an error that scales as $O(1/\sqrt{t}+M_{t}/\varepsilon)$ . While the statistical error term is optimal, the error term due to privacy is not – it does not, in fact, improve as time progresses. One can do better by using mechanisms specific to the streaming setting such as the binary mechanism [CSS10], [Dwo+10], as we describe in more detail in Section 3. Indeed, we will see that the privacy error term can be improved in two aspects: first, it can be made to decay as time progresses, and second, it only needs to grow sublinearly with $M_{t}$ .

However, a key challenge still remains. Each user is contributing multiple samples to the stream, and different samples from the same user can come at arbitrary time instants. The output must depend on the the arrival pattern of the users. For instance, when all samples in the stream are contributed by a single user, we cannot release much information. Indeed, changing this user’s data can potentially change any query responses by a large amount, leading to increased sensitivity and addition of large amounts of noise to guarantee user-level privacy. A better strategy is to withhold answers for a while until a new user arrives – this provides a sort of diversity advantage and reduces the amount of noise we need to add. The process of withholding alone does not, however, lead to an optimal error rate. We additionally need to control sensitivity by forming local averages of each users samples and then truncating these averages, as is done in the one-shot setting [Lev+21], before adding noise.

1.2 Summary of contributions

We present a modular and easy-to-implement algorithm for continual mean estimation under user-level privacy constraints. Our main algorithm has two key components: a sequence of binary mechanisms that keep track of truncated averages and a withhold-release mechanism that decides when to update the mean estimate. The role of binary mechanisms is similar to [CSS10], [DRV10] – to minimize the number of outputs each data point influences. The withhold-release mechanism releases a query only when there is “sufficient diversity,” i.e., enough users have contributed to the data. In fact, in a practical implementation of our algorithm, we will need to maintain this diversity by omitting excessive number of samples from a few users from the mean estimate evaluation. Together, these components allow us to balance the need for more data for accuracy and the need for controlling sensitivity due to a single user’s data.

The resulting performance is characterized roughly as follows; see Section 3 for the formal statement.

Main Result (Informal Version).

Our algorithm provides a user-level $\varepsilon$ -DP mechanism to output a mean estimate at every time instant $t$ when the data has “sufficient diversity” with error $\tilde{O}(1/\sqrt{t}+\sqrt{M_{t}}/t\varepsilon)$ , where $M_{t}$ is the maximum number of samples contributed by any user till time step $t$ .

We do not make any assumptions on the order in which the data points arrive from different users. The “sufficient diversity” condition in our theorem (formalized in Definition 3.3) codifies the necessity to have sufficient number of users with sufficient number of samples for our algorithm to achieve good accuracy while ensuring user-level privacy. Section 3 further elaborates the difficulty of obtaining good accuracy with arbitrary user ordering and how our exponential withhold-release mechanism helps overcome this.

1.3 Prior Work

Continually releasing query answers can leak more information when compared to the single-release setting, since each data point can now be involved in multiple query answers. In addition to this, each user can contribute multiple data points, in which case we would like to ensure that the output of any privacy-preserving mechanism we employ remains roughly the same even when all of the data points contributed by a user are changed. There have been a few foundational works on both these fronts [Dwo+10], [Jai+21], [Lev+21], but a unified treatment is lacking. Indeed, addressing the problem of user-level privacy in the streaming setting was noted as an interesting open problem in [Nik13].

Privately answering count queries in the continual release setting was first addressed in [Dwo+10], [CSS10], where the binary mechanism was introduced. This mechanism releases an estimate for the number of ones in a bit stream by using a small number of noisy partial sums. Compared to the naive scheme of adding noise at every time step which leads to an error that grows linearly with $T$ , the binary tree mechanism has error that depends logarithmically on $T$ . Following this, other works have tried to explore (item-level) private continual release in other settings [Cha+12], [Bol+13], [DKY17], [Jos+18], [PAK19], [Jai+21].

Releasing statistics under user-level privacy constraint was discussed in [Dwo+10], [Dwo+10a], and has recently begun to gain attention [Lev+21], [Cum+21], [NME22]. In particular, [Lev+21] considers user-level privacy for one-shot $d$ -dimensional mean estimation and shows that a truncation based estimator achieves error that scales as $\tilde{O}(\sqrt{d/mn}+\sqrt{d}/\sqrt{m}n\varepsilon)$ , provided there are at least $\tilde{O}(\sqrt{d}/\varepsilon)$ users. This requirement on the number of users was later improved to $\tilde{O}(1/\varepsilon)$ in [NME22]. [Cum+21] takes into account heterogeneity of user distributions for mean estimation under user-level privacy constraints.

1.4 Organization

Section 2 describes the problem setup and gives a brief recap on user-level privacy and the binary mechanism. We describe our algorithm in Section 3 and give a rough sketch of its privacy and utility guarantees. Section 4 includes a discussion on extension to $d$ -dimensional mean estimation and handling other distribution families. Finally, in Section 5 we discuss the optimality of our algorithm, and end with discussion on future directions in Section 6.

2 Preliminaries

2.1 Problem setup

We observe an input stream of the form $((x_{1},u_{1}),(x_{2},u_{2}),\ldots,(x_{T},u_{T}))$ , where $x_{t}\in\mathcal{X}$ is the sample and $u_{t}\in[n]$ is the user contributing the sample $x_{t}$ . The samples $\left(x_{t}\right)_{t\in[T}$ are drawn independently from a distribution with unknown mean. The goal is to output an estimate $\hat{\mu}_{t}$ of the mean for every $t\in[T]$ , such that the overall output $(\hat{\mu}_{1},\ldots,\hat{\mu}_{T})$ is user-level $\varepsilon$ -DP. We present our main theorems for the case when each $x_{t}$ is a Bernoulli random variable with unknown mean $\mu$ . Extensions to other distribution families are discussed in Section 4.

2.2 Differential privacy

Let $\sigma=\Big{(}(x_{t},u_{t})\Big{)}_{t\in[T]}$ and $\sigma^{\prime}=\Big{(}(x_{t}^{\prime},u_{t}^{\prime})\Big{)}_{t\in[T]}$ denote two streams of inputs. We say that $\sigma$ and $\sigma^{\prime}$ are user-level neighbors if there exists $j\in[n]$ such that $x_{t}=x_{t}^{\prime}$ for every $t\in[T]$ satisfying $u_{t}\neq j$ . We now define the notion of a user-level $\varepsilon$ -DP algorithm.

Definition 2.1.

An algorithm $\mathcal{A}:(\mathcal{X}\times[n])^{T}\rightarrow\mathcal{Y}$ is said to be user-level $\varepsilon$ -DP if for every pair of streams $\sigma,\sigma^{\prime}$ that are user-level neighbors and every subset $Y\subseteq\mathcal{Y}$ , $\mathbb{P}(\mathcal{A}(\sigma)\in Y)\leq e^{\varepsilon}\ \mathbb{P}(\mathcal{A}(\sigma^{\prime})\in Y).$

We will be using the following composition result satisfied by DP mechanisms.

Lemma 2.2.

Let $\mathcal{M}_{i}:(\mathcal{X}\times[n])^{T}\rightarrow\mathcal{Y}$ be user-level $\varepsilon_{i}$ -DP mechanisms for $i\in[k]$ . Then the composition $\mathcal{M}:(\mathcal{X}\times[n])^{T}\rightarrow\mathbb{R}^{k}$ of these mechanisms given by $\mathcal{M}(x):=(\mathcal{M}_{1}(x),\ldots,\mathcal{M}_{k}(x))$ is user-level $(\sum_{i=1}^{k}\varepsilon_{i})$ -DP.

We now define the Laplace mechanism, a basic privacy primitive we employ. We use ${\rm Lap}(b)$ to denote the Laplace distribution with parameter $b$ .

Definition 2.3.

For a function $f:\mathcal{X}^{n}\rightarrow\mathbb{R}^{k}$ , the Laplace mechanism with parameter $\varepsilon$ is a randomized algorithm $\mathcal{M}:\mathcal{X}^{n}\rightarrow\mathbb{R}^{k}$ with $\mathcal{M}(x)=f(x)+(Z_{1},\cdots,Z_{k})$ , where $Z_{1},\ldots,Z_{k}\overset{\rm i.i.d.}{\sim}{\rm Lap}(\Delta f/\varepsilon)$ , $\Delta f:=\max_{x,x^{\prime}\in\mathcal{X}^{n}}\left\lVert f(x)-f(x^{\prime})\right\rVert_{1}$ and addition is coordinate-wise.

The following lemma states the privacy-utility guarantee of the Laplace mechanism.

Lemma 2.4.

The Laplace mechanism is $\varepsilon$ -DP and guarantees

\mathbb{P}\Big{(}{\left\lVert f(x)-\mathcal{M}(x)\right\rVert_{\infty}\geq\Delta f/\varepsilon\cdot\ln(k/\delta)}\Big{)}\leq\delta\quad\forall\,\delta\in(0,1].

2.3 Binary mechanism

The binary mechanism (Algorithm 1) [CSS10] receives as input a stream $(x_{1},x_{2},\ldots,x_{T})$ and outputs a noisy sum $S_{t}$ of stream up to time $t$ at every $t\in[T]$ such that the overall output $(S_{1},S_{2},\ldots,S_{T})$ is $\varepsilon$ -DP. Furthermore, for any $t\in[T]$ , the error between the output and the running-sum satisfies $\left\lvert S_{t}-\sum_{i=1}^{t}x_{i}\right\rvert=O\left(\frac{\Delta}{\varepsilon}\log T\sqrt{\log t}\ln\frac{1}{\delta}\right)$ with probability at least $1-\delta$ . Here, $\Delta$ is the magnitude of maximum possible variation of an element in the stream; for e.g., if $1\leq x_{i}\leq 3$ for every $i\in[T]$ , then $\Delta=2$ .

We observe two important properties of the binary mechanism (Algorithm 1):

(i)

Any stream element $x_{i}$ is involved in computing at most $(1+\log T)$ terms of the array ${\rm NoisyPartialSums}$ . For e.g., $x_{1}$ is needed only while computing ${\rm NoisyPartialSums}[t]$ for $t=1,2,4,8,$ and so on.
(ii)

For any $t$ , the output $S_{t}$ is a sum of at most $(1+\log t)$ terms from the array ${\rm NoisyPartialSums}$ .

We now describe how these properties of the binary mechanism lead to privacy and utility (accuracy) guarantees.

Privacy. Since, for every $t$ , the output $S_{t}$ is a function of ${\rm NoisyPartialSums}$ , it suffices to ensure that the array ${\rm NoisyPartialSums}$ is $\varepsilon$ -DP. From (i) above, changing a stream element will change at most $(1+\log T)$ terms of the array ${\rm NoisyPartialSums}$ . Moreover, since the maximum variation of a stream element is $\Delta$ , each of these terms of ${\rm NoisyPartialSums}$ will change by at most $\Delta$ . Overall, changing an arbitrary stream element $x_{i}$ changes the $\ell_{1}$ -norm of the array ${\rm NoisyPartialSums}$ by at most $\Delta(1+\log T)$ . Thus, to ensure that the overall stream of outputs from the binary mechanism is $\varepsilon$ -DP, it suffices to add ${\rm Lap}(\eta)$ noise to each term of ${\rm NoisyPartialSums}$ , where $\eta=\Delta(1+\log T)/\varepsilon$ .

Utility. From (ii) above, since the output $S_{t}$ is a sum of at most $(1+\log t)$ terms from the array ${\rm NoisyPartialSums}$ , where each term of ${\rm NoisyPartialSums}$ has an independent ${\rm Lap}(\eta)$ noise, we have that, with probability at least $1-\delta$ , $\left\lvert S_{t}-\sum_{i=1}^{t}x_{i}\right\rvert\leq\eta\sqrt{1+\log t}\ln\frac{1}{\delta}$ .

Algorithm 1 Binary Mechanism [CSS10]

(x_{t})_{t\in[T]}

(stream),

T

(stream length),

\Delta

(max variation of a stream element),

\varepsilon

(privacy parameter)

2:Initialize

{\rm Stream}

{\rm NoisyPartialSums}

(arrays of length

T

)

\eta\leftarrow{\Delta(1+\log T)}/{\varepsilon}

\triangleright

noise parameter

4:for

t=1,2,\ldots,T

{\rm Stream}[t]\leftarrow x_{t}

6: Express

t

in binary form:

t=\sum_{j=0}^{\lfloor\log t\rfloor}(b_{j}\cdot 2^{j})

\triangleright

b_{j}\in\left\{0,1\right\}

j^{\ast}\leftarrow\min\left\{j:b_{j}\neq 0\right\}

\triangleright

e.g.

j^{\ast}=0

t

odd

{\rm NoisyPartialSums}[t]\leftarrow

10:

\left(\sum_{i=t-2^{j^{\ast}}+1}^{t}{\rm Stream}[i]\right)+\ {\rm Lap}(\eta)

11:

\triangleright

noisy sum of the latest

2^{j^{\ast}}

elements of

{\rm Stream}

12:

{\rm Sum}\leftarrow 0

{\rm index}\leftarrow 0

13: for

j=\lfloor\log t\rfloor,\lfloor\log t\rfloor-1,\ldots,0

14:

{\rm index}\leftarrow{\rm index}+(b_{j}\cdot 2^{j})

15:

{\rm Sum}\leftarrow{\rm Sum}+{\rm NoisyPartialSums}[{\rm index}]

16: end for

17: return

S_{t}={\rm Sum}

18:end for

In our algorithms, we invoke an instance of the binary mechanism as ${\rm BinMech}$ , and abstract out its functionality using the following three ingredients:

•

${\rm BinMech.Stream}$ : This is an array which will act as the input stream for the binary mechanism ${\rm BinMech}$ . In our algorithms, we will feed an element to ${\rm BinMech.Stream}$ only at certain special time instances.
•

${\rm BinMech.NoisyPartialSums}$ : This array will be of the same length as ${\rm BinMech.Stream}$ . The $k$ -th term of ${\rm BinMech.NoisyPartialSums}$ will be computed after the $k$ -th element enters ${\rm BinMech.Stream}$ . The magnitude of Laplace noise to be added while computing terms of ${\rm BinMech.NoisyPartialSums}$ (magnitude of Laplace noise would be same for all terms) will be passed as a noise parameter while invoking ${\rm BinMech}$ . The sum output by ${\rm BinMech}$ would be formed by combining terms from ${\rm BinMech.NoisyPartialSums}$ .
•

${\rm BinMech.Sum}$ : Suppose, at time $t$ , ${\rm BinMech.Stream}$ (and, thus, ${\rm BinMech.NoisyPartialSums}$ ) has $k$ elements. Then, ${\rm BinMech.Sum}$ is a function which, when invoked at time $t$ , will output $S_{k}$ (noisy sum of all $k$ elements in ${\rm BinMech.Stream}$ ) by combining terms stored in ${\rm BinMech.NoisyPartialSums}$ .

3 ALGORITHM

In this section, we build up to our main algorithm (Algorithm 6), pointing out the main ideas along the way.

A naive use of binary mechanism.

Given a stream $\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}$ (where $x_{t}\overset{\rm i.i.d.}{\sim}\mathrm{Ber}(\mu)$ ), Algorithm 2 presents a straightforward way to estimate mean $\mu$ in the continual-release user-level DP setting. The algorithm feeds $x_{t}$ to binary mechanism ${\rm BinMech}$ at every time instant $t$ . Since each user contributes at most $m$ samples, changing a user will change at most $m(1+\log T)$ terms of ${\rm BinMech.NoisyPartialSums}$ . Moreover, since the maximum variation of any stream element $x_{t}$ is $1$ , changing a user will change the $\ell_{1}$ -norm of ${\rm BinMech.NoisyPartialSums}$ by at most $m(1+\log T)$ . Thus, for $\eta=\frac{m(1+\log T)}{\varepsilon}$ , adding ${\rm Lap}(\eta)$ noise to each term of ${\rm BinMech.NoisyPartialSums}$ ensures that this algorithm is user-level $\varepsilon$ -DP. To obtain an accuracy guarantee at any given time $t$ , we observe the following: Since $x_{1},\ldots,x_{t}$ are independent $\mathrm{Ber}(\mu)$ samples, we have that, with probability at least $1-\delta$ , $\left\lvert\sum_{i=1}^{t}x_{i}-t\mu\right\rvert\leq\sqrt{\frac{t}{2}\ln\frac{2}{\delta}}$ . Moreover, since $S_{t}$ is a sum of at most $(1+\log t)$ terms from ${\rm BinMech.NoisyPartialSums}$ , where each term of ${\rm BinMech.NoisyPartialSums}$ has an independent ${\rm Lap}(\eta)$ noise, we have that, with probability at least $1-\delta$ , $\left\lvert S_{t}-\sum_{i=1}^{t}x_{i}\right\rvert\leq\eta\sqrt{1+\log t}\ln\frac{1}{\delta}$ . Thus, using union bound, we have with probability at least $1-2\delta$ that

\left\lvert\hat{\mu}_{t}-\mu\right\rvert=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{m}{t\varepsilon}\sqrt{\log t}\log T\ln\frac{1}{\delta}\right).

(1)

The first term on the right hand side of (1) is the usual statistical error that results from using $t$ samples to estimate $\mu$ . The second term in the error (“privacy error”), however, results from ensuring that the stream of estimates $(\hat{\mu}_{1},\ldots,\hat{\mu}_{T})$ is user-level $\varepsilon$ -DP; note that this term is linear in $m$ . We next describe a “wishful” scenario where the privacy error can be made proportional $\sqrt{m}$ . We will then see how to obtain such a result in the general scenario, which is our main contribution.

Algorithm 2 Continual mean estimation (naive)

\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}

(stream),

T

(stream length),

m

(max no. of samples per user),

\varepsilon

(privacy parameter),

\delta

(failure probability).

2:Initialize binary mechanism

{\rm BinMech}

with noise level

\eta=\frac{m(1+\log T)}{\varepsilon}

3:for

t=1,2,\ldots,T

{\rm BinMech.Stream}\leftarrow{\rm BinMech.Stream}\cup\left\{x_{t}\right\}

S_{t}\leftarrow{\rm BinMech.Sum}

6: return

\hat{\mu}_{t}=\frac{S_{t}}{t}

7:end for

A better algorithm in a wishful scenario: Exploiting concentration.

Naive use of the binary mechanism for continual mean estimation fails to exploit the concentration phenomenon that results from each user contributing multiple i.i.d. samples to the stream. To see how concentration might help, consider the following scenario. Suppose that (somehow) we already have a prior estimate $\tilde{\mu}$ that satisfies $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Also, assume a user ordering where every user contributes their $m$ samples in contiguous time steps. That is, user $1$ contributes samples for the first $m$ time steps, followed by user 2 who contributes samples for the next $m$ time steps, and so on. In this case, Algorithm 3 presents a way to exploit the concentration phenomenon. Even though this algorithm outputs $\hat{\mu}_{t}$ at every $t$ , it only updates $\hat{\mu}_{t}$ at $t=m,2m,3m,\ldots,nm$ . Upon receiving a sample from a user, the algorithm does not immediately add it to ${\rm BinMech.Stream}$ . Instead, the algorithm waits for a user to contribute all their $m$ samples. It then computes the sum of those $m$ samples and projects the sum to the interval

\mathcal{I}=\biggl{[}m\tilde{\mu}-\Delta,m\tilde{\mu}+\Delta\biggr{]}\text{ where }\Delta=\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}+{\sqrt{m}}.

(2)

before feeding the projected sum to ${\rm BinMech.Stream}$ . By concentration of sum of i.i.d. Bernoulli random variables, and the fact that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ (by assumption), we have with probability at least $1-\delta$ that the sum of $m$ samples corresponding to every user will fall inside the interval $\mathcal{I}$ ; thus, the projection operator $\Pi(\cdot)$ has no effect with high probability.

Since there are $n$ users, at most $n$ elements are added to ${\rm BinMech.Stream}$ throughout the course of the algorithm. Now, since there can be at most $n$ elements in ${\rm BinMech.Stream}$ , a given element in ${\rm BinMech.Stream}$ will be used at most $1+\log n$ times while computing ${\rm BinMech.NoisyPartialSums}$ (see Section 2.3). So, changing a user can change the $\ell_{1}$ -norm of ${\rm BinMech.NoisyPartialSums}$ by at most $(2\Delta)(1+\log n)$ , where $\Delta$ is as in (2). Thus, to ensure that the algorithm is $\varepsilon$ -DP (given initial estimate $\tilde{\mu}$ ), it suffices to add independent ${\rm Lap}(\eta)$ noise while computing each term in ${\rm BinMech.NoisyPartialSums}$ , where

\eta=\frac{2\Delta(1+\log n)}{\varepsilon}=\frac{2\left(\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}+{\sqrt{m}}\right)(1+\log n)}{\varepsilon}.

(3)

Note that $\eta$ defined above is proportional to $\sqrt{m}$ , whereas the $\eta$ defined in the naive use of binary mechanism was proportional to $m$ . Thus (details in Appendix B), one obtains a privacy error of $\tilde{O}(\sqrt{m}/t\varepsilon)$ , while still having a statistical error of $O(1/\sqrt{t})$ at every $t$ .

Algorithm 3 Continual mean estimation (wishful scenario)

\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}

(stream with

m

samples from any user arriving contiguously),

n

(no. of users),

m

(no. of samples per user),

\varepsilon

(privacy),

\tilde{\mu}

(estimate of

\mu

s.t.

\left\lvert\tilde{\mu}-\mu\right\rvert\leq 1/\sqrt{m}

\delta

(failure probability).

2:Let

\Pi(\cdot)

be the projection on the interval

\mathcal{I}

, where

\mathcal{I}=\biggl{[}m\tilde{\mu}-\Delta,m\tilde{\mu}+\Delta\biggr{]}\text{ where }\Delta=\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}+{\sqrt{m}}.

3:Initialize binary mechanism

{\rm BinMech}

with noise level

\eta

where

\eta=\frac{2\Delta(1+\log n)}{\varepsilon}.

4:Initialize

{\rm total}\leftarrow 0

5:for

t<m

6: return

\hat{\mu}_{t}=\tilde{\mu}

7:end for

8:for

t\geq m

9: if

t\in\left\{m,2m,3m,\ldots,nm\right\}

then

10:

{\rm total}\leftarrow{\rm total}+m

11:

\sigma=\Pi\left(\sum_{j=t-m+1}^{t}x_{j}\right)

12:

{\rm BinMech.Stream}\leftarrow{\rm BinMech.Stream}\cup\left\{\sigma\right\}

13: end if

14:

S_{t}\leftarrow{\rm BinMech.Sum}

15: return

\hat{\mu}_{t}=\frac{S_{t}}{\rm total}

16:end for

The scenario here is wishful for two reasons: (i) we assume a prior estimate $\tilde{\mu}$ that we used to form the interval (2); (ii) we assume a special user ordering where every user contributed their samples contiguously. This user ordering ensures that $\hat{\mu}_{t}$ is updated after every $m$ time steps, and thus, for every $t$ , at least $t/2$ samples are used in computing $\hat{\mu}_{t}$ ; this is the reason that the statistical error remains $O(1/\sqrt{t})$ . Note that this algorithm will perform very poorly in the general user ordering. For instance, if the users contribute samples in a round-robin fashion (where a sample from user $1$ is followed by a sample from user $2$ and so on), then the algorithm would have to wait for time $t=n(m-1)$ to obtain $m$ samples from any given user.

Although wishful, there are two main ideas that we take away from this discussion: One is the idea of withholding. That is, it is not necessary to incorporate information about every sample received till time $t$ to compute $\hat{\mu}_{t}$ . Second is the idea of truncation. That is, the worst-case variation in the quantity of interest due to variation in a user’s data can be reduced by projecting user’s contributions on a smaller interval. This idea of exploiting concentration using truncation is also used in [Lev+21] for mean estimation in the single-release setting.

Designing algorithms for worst-case user order: Exponential withhold-release pattern.

The algorithm discussed above suffered in the worst-case user ordering since samples from a user were “withheld” until every sample from that user was obtained. This was done to exploit (using “truncation”) the concentration of sum of $m$ i.i.d. samples from the user. To exploit the concentration of sum of i.i.d. samples in the setting where the user order can be arbitrary, we propose the idea of withholding the samples (and applying truncation) at exponentially increasing intervals. Namely, for a given user $u$ , we do not withhold the first two samples $x_{1}^{(u)},x_{2}^{(u)}$ ; then, we withhold $x_{3}^{(u)}$ and release truncated version of $(x_{3}^{(u)}+x_{4}^{(u)})$ when we receive $x_{4}^{(u)}$ ; we then withhold $x_{5}^{(u)},x_{6}^{(u)},x_{7}^{(u)}$ and release truncated version of $(x_{5}^{(u)}+x_{6}^{(u)}+x_{7}^{(u)}+x_{8}^{(u)})$ when we receive $x_{8}^{(u)}$ ; and so on. In general, we withhold samples $x_{2^{\ell-1}+1}^{(u)},\ldots,x_{2^{\ell}-1}^{(u)}$ and release truncated version of $\sum_{i=2^{\ell-1}+1}^{2^{\ell}}x_{i}^{(u)}$ when we receive the $2^{\ell}$ -th sample $x_{2^{\ell}}^{(u)}$ from user $u$ .

We now present Algorithm 4, which uses this exponential withhold-release idea and, assuming a prior $\tilde{\mu}$ , outputs an estimate with statistical error $O(1\sqrt{t})$ and privacy error $\tilde{O}(\sqrt{m}/t\varepsilon)$ for arbitrary user ordering.

Algorithm 4 Continual mean estimation assuming prior (single binary mechanism)

\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}

(stream),

n

(max no. of users),

m

(max no. of samples per user),

\varepsilon

(privacy parameter),

\tilde{\mu}

(estimate of

\mu

satisfying

\left\lvert\tilde{\mu}-\mu\right\rvert\leq 1/\sqrt{m}

\delta

(failure probability).

2:- Let

x^{(u)}_{j}

denote the

j

-th sample contributed by user

u

3:- For

\ell\geq 1

, let

\Pi_{\ell}(\cdot)

be the projection on the interval

\mathcal{I}_{\ell}

, where

\mathcal{I}_{\ell}:=\left[2^{\ell-1}\tilde{\mu}-\Delta_{\ell},2^{\ell-1}\tilde{\mu}+\Delta_{\ell}\right],\quad\Delta_{\ell}=\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}.

4:Initialize binary mechanism

{\rm BinMech}

with noise level

\eta(m,n,\delta)

, where

\eta(m,n,\delta)=\frac{2\Delta(1+\log m)\log(1+n(1+\log m))}{\varepsilon},\quad\Delta=\sqrt{\frac{m}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{m}.

5:For each user

u

, let

M(u)=

no. of times user

u

has contributed a sample; initialize

M(u)\leftarrow 0

6:Initialize

{\rm total}\leftarrow 0

7:for

t=1,2,\ldots,T

M(u_{t})\leftarrow M(u_{t})+1

9: if

\log M(u_{t})\in\mathbb{Z}_{+}

then

10:

\ell\leftarrow\log M(u_{t})

\triangleright

i.e.

M(u_{t})=2^{\ell}

11:

{\rm total}\leftarrow{\rm total}+M(u_{t})

12: if

\ell=0

then

\sigma\leftarrow x_{t}

13: else if

\ell\in\left\{1,2,3,\ldots\right\}

then

14:

\sigma\leftarrow\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u_{t})}_{j}\right)

15: end if

16:

{\rm BinMech.Stream}\leftarrow{\rm BinMech.Stream}\cup\left\{\sigma\right\}

17: end if

18:

S_{t}\leftarrow{\rm BinMech.Sum}

19: return

\hat{\mu}_{t}=\frac{S_{t}}{\rm total}

20:end for

Continual mean estimation assuming prior estimate: Algorithm 4.

Let $x^{(u)}_{j}$ be the $j$ -th sample obtained from user $u$ . Let $\Pi_{\ell}(\cdot)$ be the projection on the interval $\mathcal{I}_{\ell}$ , where

\mathcal{I}_{\ell}:=\left[2^{\ell-1}\tilde{\mu}-\Delta_{\ell},2^{\ell-1}\tilde{\mu}+\Delta_{\ell}\right],\quad\Delta_{\ell}=\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}.

(4)

Let $u_{t}$ be the user at time $t$ . Upon receiving a sample from a user, the algorithm does not immediately add it to ${\rm BinMech.Stream}$ . Instead, the algorithm only adds anything new to ${\rm BinMech.Stream}$ at time instances $t$ when the total number of samples obtained from user $u_{t}$ becomes $2^{\ell}$ , for $\ell\in\left\{0,1,2,\ldots\right\}$ . At such a time, the algorithm adds $\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u_{t})}_{j}\right)$ to ${\rm BinMech.Stream}$ . (The summation inside $\Pi_{\ell}$ makes sense only for $\ell\geq 1$ ; for $\ell=0$ , which corresponds to the first sample from user $u_{t}$ , the algorithm simply adds the sample to ${\rm BinMech.Stream}$ .) This has the effect that, corresponding to a user, there are at most $(1+\log m)$ elements in ${\rm BinMech.Stream}$ . Since, there are at most $n$ users, there are at most $n(1+\log m)$ elements added to ${\rm BinMech.Stream}$ throughout the course of the algorithm.

Now, since there can be at most $n(1+\log m)$ elements in ${\rm BinMech.Stream}$ , a given element in ${\rm BinMech.Stream}$ will be used at most $\log(n(1+\log m))$ times while computing ${\rm BinMech.NoisyPartialSums}$ (see Section 2.3). Thus, changing a user can change the $\ell_{1}$ -norm of ${\rm BinMech.NoisyPartialSums}$ by at most $(1+\log m)(\log(n(1+\log m)))\Delta$ , where $\Delta$ is the maximum sensitivity of an element contributed by a user to ${\rm BinMech.Stream}$ . As can be seen from (4), we have $\Delta=2\left(\sqrt{\frac{m}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{m}\right)$ . Thus, to ensure that Algorithm 4 is $\varepsilon$ -DP (given initial estimate $\tilde{\mu}$ ), it suffices to add independent ${\rm Lap}(\eta(m,n,\delta))$ noise while computing each term in ${\rm BinMech.NoisyPartialSums}$ , where

\eta(m,n,\delta)=\frac{2\Delta(1+\log m)\log(1+n(1+\log m))}{\varepsilon},\quad\Delta=\sqrt{\frac{m}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{m}.

(5)

Note that, as in (3), the magnitude of noise in (5) is $\tilde{O}(\sqrt{m})$ . Moreover, with exponential withhold-release pattern, we are guaranteed that at any time $t$ , the estimate $\hat{\mu}_{t}$ is computed using at least $t/2$ samples, no matter what the user ordering is (Claim C.1 in Appendix C). This gives us the following guarantee (proof in Appendix C.2):

Theorem 3.1.

Assume that we are given a user-level $\varepsilon$ -DP prior estimate $\tilde{\mu}$ of the true mean $\mu$ , such that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Then, Algorithm 4 is $2\varepsilon$ -DP. Moreover, for a given $t\in[T]$ , we have with probability at least $1-2\delta$ ,

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{m}}{t\varepsilon}\right).

We now present Algorithm 5, which, in conjunction with exponential withhold-release, uses multiple binary mechanisms. Assuming a prior estimate $\tilde{\mu}$ , this algorithm outputs an estimate with statistical error $O(1/\sqrt{t})$ and privacy error $\tilde{O}(\sqrt{M_{t}}/t\varepsilon)$ , where $M_{t}$ is the maximum number of sample contributed by a user. Note that $M_{t}$ can be much smaller than $m$ for large values of $m$ .

Continual mean estimation assuming prior estimate: Algorithm 5.

In this algorithm, we use $L+1$ binary mechanisms ${\rm BinMech}[0],\ldots,{\rm BinMech}[L]$ , where $L:=\lceil\log m\rceil$ . As was the case with Algorithm 4, here too the algorithm only adds anything new to a binary mechanism at time instances $t$ when the total number of samples obtained from user $u_{t}$ becomes $2^{\ell}$ , for $\ell\in\left\{0,1,2,\ldots\right\}$ . What differentiates Algorithm 5 from Algorithm 4 is that we use different binary mechanisms for different values of $\ell$ . Namely, the element $\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u)}_{j}\right)$ from a user $u$ is added to ${\rm BinMech}[\ell].{\rm Stream}$ . This ensures that each element in ${\rm BinMech}[\ell].{\rm Stream}$ has maximum sensitivity $2\Delta_{\ell}$ , where (from (4)), $\Delta_{\ell}=\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}$ .

Algorithm 5 Continual mean estimation assuming prior (multiple binary mechanisms)

\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}

(stream),

n

(max no. of users),

m

(max no. of samples per user),

\varepsilon

(privacy parameter),

\tilde{\mu}

(estimate of

\mu

satisfying

\left\lvert\tilde{\mu}-\mu\right\rvert\leq 1/\sqrt{m}

\delta

(failure probability).

2:- Let

x^{(u)}_{j}

denote the

j

-th sample contributed by user

u

3:- For

\ell\in\mathbb{Z}_{+}

, let

\Pi_{\ell}(\cdot)

be the projection on the interval

\mathcal{I}_{\ell}

, where

\mathcal{I}_{\ell}:=\left[2^{\ell-1}\tilde{\mu}-\Delta_{\ell},2^{\ell-1}\tilde{\mu}+\Delta_{\ell}\right],\quad\Delta_{\ell}=\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}.

4:- Let

L:=\lceil\log m\rceil

5:Initialize

L+1

binary mechanisms, labelled

{\rm BinMech}[0],\ldots,{\rm BinMech}[L]

, where

{\rm BinMech}[\ell]

is initialized with noise level

\eta(m,n,\ell,\delta)

, where

\displaystyle\eta(m,n,\ell,\delta)=\frac{2\left(\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{2^{\ell-1}}\right)(1+\log n)}{\varepsilon/(L+1)}.

6:For each user

u

, let

M(u)=

no. of times user

u

has contributed a sample; initialize

M(u)\leftarrow 0

7:Initialize

{\rm total}\leftarrow 0

8:for

t=1,2,\ldots,T

M(u_{t})\leftarrow M(u_{t})+1

10: if

\log M(u_{t})\in\mathbb{Z}_{+}

then

11:

\ell\leftarrow\log M(u_{t})

12:

{\rm total}\leftarrow{\rm total}+M(u_{t})

13: if

\ell=0

then

\sigma\leftarrow x_{t}

14: else if

\ell\in\left\{1,2,3,\ldots\right\}

then

15:

\sigma\leftarrow\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u_{t})}_{j}\right)

16: end if

17:

{\rm BinMech}[\ell].{\rm Stream}

\leftarrow

{\rm BinMech}[\ell].{\rm Stream}\cup\left\{\sigma\right\}

18: end if

19:

S_{t}\leftarrow\sum_{i=0}^{L}

{\rm BinMech}[i].{\rm Sum}

20: return

\hat{\mu}_{t}=\frac{S_{t}}{\rm total}

21:end for

Note that

\displaystyle\Delta_{\ell}\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{2^{\ell-1}}

(6)

where the inequality holds because $m\geq 2^{\ell-1}$ . To ensure that Algorithm 5 is $\varepsilon$ -DP, we will make each binary mechanism $\frac{\varepsilon}{L+1}$ -DP. For any $\ell\in\left\{0,\ldots,L\right\}$ , since each user contributes at most one element to ${\rm BinMech}[\ell].{\rm Stream}$ , there are at most $n$ elements in ${\rm BinMech}[\ell].{\rm Stream}$ throughout the course of the algorithm. Moreover, since every element in ${\rm BinMech}[\ell].{\rm Stream}$ has sensitivity at most $2\Delta_{\ell}$ , it suffices to add independent ${\rm Lap}(\eta(m,n,\ell,\delta))$ noise while computing each term in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ , where

\displaystyle\eta(m,n,\ell,\delta)=\frac{2\left(\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{2^{\ell-1}}\right)(1+\log n)}{\varepsilon/(L+1)}.

(7)

Comparing $\eta(m,n,\ell,\delta)$ above to $\eta(m,n,\delta)$ in (5), we see that using multiple binary mechanisms allows us to add noise with magnitude that is more fine-tuned to exponential withhold-release pattern. In particular, if $M_{t}$ is the maximum number of samples contributed by any user till time $t$ , then at most $\lceil\log M_{t}\rceil$ binary mechanisms would be in use, and thus the maximum noise level $\eta(m,n,\ell,\delta)$ would be proportional to $\sqrt{M_{t}}$ . This gives us the following guarantee (proof in Appendix D):

Theorem 3.2.

Assume that we are given a user-level $\varepsilon$ -DP prior estimate $\tilde{\mu}$ of the true mean $\mu$ , such that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Then, Algorithm 5 is $2\varepsilon$ -DP. Moreover, for any given $t\in[T]$ , we have with probability at least $1-\delta$ that

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right),

where $M_{t}$ denotes the maximum number of samples obtained from any user till time $t$ , i.e., $M_{t}=\max\left\{m_{u}(t):u\in[n]\right\}$ , where $m_{u}(t)$ is the number of samples obtained from user $u$ till time $t$ .

We now present our final algorithm, Algorithm 6, which does not assume a prior estimate. Observe that the prior estimate $\tilde{\mu}$ was needed in previous algorithms only to form truncation intervals $\mathcal{I}_{\ell}$ (see (4)). Algorithm 6 includes estimating $\tilde{\mu}$ as a subroutine. In fact, as we describe next, the algorithm estimates a separate prior for each binary mechanism.

Continual mean estimation without assuming prior estimate: Algorithm 6.

Since the algorithm uses $L+1$ (where $L:=\lceil\log m\rceil$ ) binary mechanisms, we need not estimate $\tilde{\mu}$ up to accuracy $O(1/\sqrt{m})$ in one go. Instead, we can have a separate $\tilde{\mu}$ for each binary mechanism, which is helpful since ${\rm BinMech}[\ell]$ requires $\tilde{\mu}$ only up to an accuracy $O(1/\sqrt{2^{\ell-1}})$ (see (6)).

In this algorithm, we mark ${\rm BinMech}[\ell]$ as “inactive” till we have sufficient number of users with sufficient number of samples to estimate a user-level $\frac{\varepsilon}{2L}$ -DP prior $\tilde{\mu}_{\ell}$ up to an accuracy of $\tilde{O}(1/\sqrt{2^{\ell-1}})$ . While ${\rm BinMech}[\ell]$ is inactive, we store the elements that require $\tilde{\mu}_{\ell}$ for truncation in ${\rm Buffer}[\ell]$ (see Line 24-25; each element of ${\rm Buffer}[\ell]$ is a sum of $2^{\ell-1}$ samples, which we cannot truncate yet). Once we have sufficient number of users with sufficient number of samples (Line 9 lists the exact condition), we use a private median estimation algorithm (Algorithm 7) from [FS17] to estimate $\tilde{\mu}_{\ell}$ . At this point, we use $\tilde{\mu}_{\ell}$ to truncate elements stored in ${\rm Buffer}[\ell]$ , and pass these elements to ${\rm BinMech}[\ell].{\rm Stream}$ (Line 11).

Algorithm 6 Continual mean estimation: Full algorithm

\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}

(stream),

n

(max no. of users),

m

(max no. of samples per user),

\varepsilon

(privacy parameter),

\delta

(failure probability)

2:Let

x^{(u)}_{j}

denote the

j

-th sample contributed by user

u

. Let

L:=\lceil\log m\rceil

. For

\ell\geq 1

, let

\Pi_{\ell}(\cdot)

be the projection on the interval

\mathcal{I}_{\ell}

defined as

\mathcal{I}_{\ell}=\left[2^{\ell-1}\tilde{\mu}_{\ell}-\Delta_{\ell},2^{\ell-1}\tilde{\mu}_{\ell}+\Delta_{\ell}\right]

(8)

where

\tilde{\mu}_{\ell}

is as in Line 10, and

\Delta_{\ell}=\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta/3}}+\sqrt{2^{\ell}\ln\frac{2k(\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L})}{\delta/3L}},\quad\text{where }k(\varepsilon^{\prime},\ell,\beta):=\frac{16}{\varepsilon^{\prime}}\ln\frac{2^{\ell/2}}{\beta}.

(9)

3:Initialize

L+1

binary mechanisms, labelled

{\rm BinMech}[0],\ldots,{\rm BinMech}[L]

, where

{\rm BinMech}[\ell]

is initialized with noise level

\eta(m,n,\ell,\delta)

defined as

\eta(m,n,\ell,\delta)=\frac{2\Delta_{\ell}(1+\log n)}{\varepsilon/2(L+1)}

(10)

4:Initialize

{\rm Inactive}\leftarrow\left\{2,\ldots,L\right\}

5:Initialize

L-1

buffers labelled

{\rm Buffer}[2],\ldots,{\rm Buffer}[L]

6:For each user

u

, let

M(u)=

no. of times user

u

has contributed a sample; initialize

M(u)\leftarrow 0

7:Initialize

{\rm total}\leftarrow 0

8:for

t=1,2,\ldots,T

M(u_{t})\leftarrow M(u_{t})+1

10:

\%

Activating binary mechanisms (if possible):

11: for

\ell\in{\rm Inactive}

12: if

\sum_{u=1}^{n}\min\left\{M(u),2^{\ell-1}\right\}\geq

2^{\ell-1}\frac{16}{\varepsilon}\left(2L\ln\frac{3L2^{\ell/2}}{\delta}\right)

then

13:

\tilde{\mu}_{\ell}={\rm PrivateMedian}\left((x_{i},u_{i})_{i=1}^{t},\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L}\right)

\triangleright

Algorithm 7

14:

{\rm BinMech}[\ell].{\rm Stream}\leftarrow\Pi_{\ell}\left({\rm Buffer}[\ell]\right)

15:

{\rm total}\leftarrow{\rm total}+2^{\ell-1}{\rm Size}\left({\rm Buffer}[\ell]\right)

16: end if

17: end for

18:

\%

19: if

\log M(u_{t})\in\mathbb{Z}_{+}

then

20:

\ell\leftarrow\log M(u_{t})

21: if

\ell=0

then

\sigma\leftarrow x_{t}

22: else if

\ell\in\left\{1,2,\ldots,L\right\}

then

23:

\sigma\leftarrow\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u_{t})}_{j}

24: end if

25: if

\ell\notin{\rm Inactive}

then

26:

{\rm BinMech}[\ell].{\rm Stream}

\leftarrow

{\rm BinMech}[\ell].{\rm Stream}

\cup\left\{\Pi_{\ell}\left(\sigma\right)\right\}

27:

{\rm total}\leftarrow{\rm total}+M(u_{t})

28: else if

\ell\in{\rm Inactive}

then

29:

{\rm Buffer}[\ell]

\leftarrow

{\rm Buffer}[\ell]

\cup\left\{\sigma\right\}

30: end if

31: end if

32:

S_{t}\leftarrow\sum_{i=0}^{L}

{\rm BinMech}[i].{\rm Sum}

33: return

\hat{\mu}_{t}=\frac{S_{t}}{\rm total}

34:end for

We note that the private median estimation algorithm is also used in [Lev+21] in the single-release setting with all users contributing equal number of samples; we extend this to the setting where number of samples from users can vary. In Appendix E.1 (Claim E.1), we show that, for any $\ell\geq 1$ , this modified private median estimation algorithm (Algorithm 7) is user-level $\varepsilon$ -DP. Moreover, with probability at least $1-\delta-\beta$ , we have that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k(\varepsilon,\ell,\beta)}{\delta}}$ , where $k(\varepsilon,\ell,\beta)=\frac{16}{\varepsilon}\ln\frac{2^{\ell/2}}{\beta}$ .

Algorithm 7

{\rm PrivateMedian}

(subroutine for Line 10 in Algorithm 6)

(x_{i},u_{i})_{i=1}^{t}

\varepsilon

(privacy),

\ell

(scale),

\beta

(failure probability)

\sum_{u=1}^{n}\min\left\{m_{u}(t),2^{\ell-1}\right\}\geq 2^{\ell-1}\frac{16}{\varepsilon}\ln\frac{2^{\ell/2}}{\beta}

, where

m_{u}(t)

is the no. of times user

u

occurs in

(u_{i})_{i=1}^{t}

4:Initialize

k:=\left(\frac{16}{\varepsilon}\ln\frac{2^{\ell/2}}{\beta}\right)

arrays

S_{1},\ldots,S_{k}

, each of size

2^{\ell-1}

\%

Forming

k

arrays, each containing

2^{\ell-1}

samples:

j\leftarrow 1

7:for

u\in[n]

8: Let

r=\min\left\{m_{u}(t),2^{\ell-1}\right\}

9: One by one, start storing

r

samples from user

u

in array

S_{j}

10: At any point, if array

S_{j}

becomes full, increment

j

j\leftarrow j+1

11: Exit loop once array

S_{k}

becomes full.

\triangleright

The ‘Require’ condition ensures that this eventually happens.

12:end for

13:

\%

14:For

j\in[k]

, let

Y_{j}

be the sample mean of all the samples in

S_{j}

15:

16:

\%

The steps below are as in [FS17],[Lev+21] mutatis mutandis:

17:Divide the interval

[0,1]

into disjoint subintervals (“bins”), each of length

(2\cdot 2^{-\ell/2})

. The last subinterval can be of shorter length if

1/(2\cdot 2^{-\ell/2})

is not an integer. Let

\mathcal{T}

be the set of middle points of these subintervals.

18:For

j\in[k]

, let

Y^{\prime}_{i}=\arg\min_{y\in\mathcal{T}}\left\lvert Y_{i}-y\right\rvert

be the point in

\mathcal{T}

closest to

Y_{i}

19:Define cost function

c:\mathcal{T}\to\mathbb{R}

c(y):=\max\left\{\left\lvert\left\{j\in[k]:Y^{\prime}_{j}<y\right\}\right\rvert,\left\lvert\left\{j\in[k]:Y^{\prime}_{j}>y\right\}\right\rvert\right\}.

20:Let

\tilde{\mu}

be a sample drawn from the distribution satisfying

\mathrm{Pr}\left\{\tilde{\mu}=y\right\}\propto\exp\left(-\frac{\varepsilon}{4}c(y)\right).

\triangleright

Note that we have

-\frac{\varepsilon}{4}c(y)

\exp(\cdot)

whereas [FS17],[Lev+21] had

-\frac{\varepsilon}{2}c(y)

21:return

\tilde{\mu}

Algorithm 6 demonstrates another advantage of using multiple binary mechanisms – that we can have different priors for different binary mechanisms, which means that the algorithm does not need to wait for long to start outputting estimates with good guarantees. Theorem 3.4 (proof in Appendix E.2) states the exact guarantees ensured by Algorithm 6. Before stating the theorem, we define what we mean by “diversity condition”.

Definition 3.3 (Diversity condition).

We say that “diversity condition holds at time $t$ ” if

\sum_{u=1}^{n}\min\left\{m_{u}(t),\frac{M_{t}}{2}\right\}\geq\frac{M_{t}}{2}\frac{16}{\varepsilon}\left(2L\ln\frac{3L\sqrt{M_{t}}}{\delta}\right)

(11)

where $L:=\lceil\log m\rceil$ and $M_{t}$ is the maximum number of samples contributed by any user till time $t$ . That is, $M_{t}:=\max\left\{m_{u}(t):u\in[n]\right\}$ , where $m_{u}(t)$ is the number of samples contributed by user $u$ till time $t$ .

In words, condition (11) says that we need the number of users at time $t$ to be large enough so that we can form a collection of $\frac{M_{t}}{2}\frac{16}{\varepsilon}\left(2L\ln\frac{3L\sqrt{M_{t}}}{\delta}\right)$ samples by using at most $\frac{M_{t}}{2}$ samples per user. A sufficient condition to ensure (11) holds is that there are at least $\frac{16}{\varepsilon}\left(2L\ln\frac{3L\sqrt{M_{t}}}{\delta}\right)$ users that have contributed at least $\frac{M_{t}}{2}$ samples each till time $t$ . In particular, since $M_{t}\leq m$ , we have the following: Once there are $\frac{16}{\varepsilon}\left(L\ln\frac{3L\sqrt{m}}{\delta}\right)$ users that have contributed at least $\frac{m}{2}$ samples each till time $t_{0}$ , then diversity condition holds for every $t\geq t_{0}$ .

Theorem 3.4.

Algorithm 6 for continual Bernoulli mean estimation is user-level $\varepsilon$ -DP. Moreover, if at time $t\in[T]$ , diversity condition (11) holds, then, with probability at least $1-\delta$ (for arbitrary $\delta\in(0,1]$ ),

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right).

What happens at time instants when the diversity condition does not hold? This happens when there are very few users who have contributed a very large number of samples. Algorithm 6 stores these samples in buffers (since the corresponding binary mechanisms are “inactive”) and does not use them to estimate $\hat{\mu}_{t}$ . This is done to preserve user-level privacy and seems like a necessary thing to do. However, currently we do not know whether there is a user-level private way to use these extra samples from few users. In other words, it is not clear if our diversity condition can be weakened.

4 EXTENSIONS

To perform mean estimation on $d$ -dimensional inputs with independent coordinates drawn from $\text{Ber}(\mu)$ , one can simply use Algorithm 6 on each coordinate. Since the release corresponding to each coordinate is user-level $\varepsilon$ -DP, we get that the overall algorithm is $d\varepsilon$ -DP by basic composition. However, if we only require an approximate differential privacy guarantee, then [DRV10, Theorem III.3] shows that the full sequence of releases from all coordinates will be $\bigg{(}\varepsilon\sqrt{d\ln(1/\tilde{\delta})},\tilde{\delta}\bigg{)}$ -DP for every $\tilde{\delta}\in(0,1]$ . Rescaling the privacy parameter to ensure $(\varepsilon,\tilde{\delta})$ -DP overall gives error $\tilde{O}(1/\sqrt{t}+\sqrt{dM_{t}}/t\varepsilon)$ , provided $\sum_{u=1}^{n}\min\{m_{u}(t),M_{t}/2\}\geq\tilde{O}(\sqrt{dM_{t}}/\varepsilon)$ . These arguments carry over to the case of subgaussian distributions as well.

5 LOWER BOUND

Consider the single-release mean estimation problem with $n$ users, each having $m$ samples, where the estimated mean must be user-level $\varepsilon$ -DP. In this setting, a lower bound of $\Omega\left(\frac{1}{\sqrt{mn}}+\frac{1}{\sqrt{m}n\varepsilon}\right)$ on achievable accuracy is known (Theorem 3 in [Liu+20]). Furthermore, in the same setting, Theorem 9 in [Lev+21] shows that any algorithm that is user-level $\varepsilon$ -DP requires $n=\Omega\left(\frac{1}{\varepsilon}\right)$ users. To see what these lower bounds say about our proposed continual-release algorithm (Algorithm 6), let $t$ be a time instant where $N$ users have contributed $M$ samples each (thus, $t=NM$ ). In this case, $M_{t}=M$ , and Theorem 3.4 gives an accuracy guarantee of $\tilde{O}\left(\frac{1}{MN}+\frac{1}{\sqrt{M}N\varepsilon}\right)$ , provided $N=\tilde{\Omega}\left(\frac{1}{\varepsilon}\right)$ (diversity condition). This matches the single-release lower bounds (on both accuracy and number of users) up to log factors.

6 DISCUSSION

We have shown that Algorithm 6 is almost optimal at every time instant where users have contributed equal number of samples. However, what about the settings when different users contribute different number of samples? Is it optimal, for instance, to not use excessive samples from a single user? The answer is not very clear even in the single-release setting. Investigating this is an interesting future direction.

References

[NS08] Arvind Narayanan and Vitaly Shmatikov “Robust De-anonymization of Large Sparse Datasets” In 2008 IEEE Symposium on Security and Privacy, 2008, pp. 111–125
[SAW13] Latanya Sweeney, Akua Abu and Julia Winn “Identifying Participants in the Personal Genome Project by Name” In Data Privacy Lab, IQSS, Harvard University, 2013 URL: http://dataprivacylab.org/projects/pgp/
[GAM19] Simson L. Garfinkel, John M. Abowd and Chris Martindale “Understanding database reconstruction attacks on public data” In Communications of the ACM 62, 2019, pp. 46–53
[Dwo+06] Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith “Calibrating Noise to Sensitivity in Private Data Analysis” In TCC 3876, Lecture Notes in Computer Science Springer, 2006, pp. 265–284
[Ami+19] Kareem Amin, Alex Kulesza, Andres Muñoz Medina and Sergei Vassilvitskii “Bounding User Contributions: A Bias-Variance Trade-off in Differential Privacy” In Proceedings of the 36th International Conference on Machine Learning PMLR, 2019, pp. 263–271
[CSS10] T.-H. Chan, Elaine Shi and Dawn Song “Private and Continual Release of Statistics” In ICALP (2) 6199, Lecture Notes in Computer Science Springer, 2010, pp. 405–417
[Dwo+10] Cynthia Dwork, Moni Naor, Toniann Pitassi and Guy N. Rothblum “Differential Privacy under Continual Observation” In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC ’10 Cambridge, Massachusetts, USA: Association for Computing Machinery, 2010, pp. 715–724
[Lev+21] Daniel Levy et al. “Learning with User-Level Privacy” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021, pp. 12466–12479
[DRV10] Cynthia Dwork, Guy N. Rothblum and Salil Vadhan “Boosting and Differential Privacy”, FOCS ’10 IEEE Computer Society, 2010, pp. 51–60 URL: https://doi.org/10.1109/FOCS.2010.12
[Jai+21] Palak Jain, Sofya Raskhodnikova, Satchit Sivakumar and Adam D. Smith “The Price of Differential Privacy under Continual Observation” In CoRR abs/2112.00828, 2021 arXiv: https://arxiv.org/abs/2112.00828
[Nik13] Aleksandar Nikolov “Differential Privacy in the Streaming World”, Simons Institute Workshop on Big Data and Differential Privacy, 2013
[Cha+12] T.-H. Chan, Mingfei Li, Elaine Shi and Wenchang Xu “Differentially Private Continual Monitoring of Heavy Hitters from Distributed Streams” In Proceedings of the 12th International Conference on Privacy Enhancing Technologies, PETS’12 Vigo, Spain: Springer-Verlag, 2012, pp. 140–159
[Bol+13] Jean Bolot et al. “Private Decayed Predicate Sums on Streams” In Proceedings of the 16th International Conference on Database Theory, ICDT ’13 Genoa, Italy: Association for Computing Machinery, 2013, pp. 284–295
[DKY17] Bolin Ding, Janardhan Kulkarni and Sergey Yekhanin “Collecting Telemetry Data Privately” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017
[Jos+18] Matthew Joseph, Aaron Roth, Jonathan Ullman and Bo Waggoner “Local Differential Privacy for Evolving Data” In Proceedings of the 32nd International Conference on Neural Information Processing Systems Montréal, Canada: Curran Associates Inc., 2018, pp. 2381–2390
[PAK19] Victor Perrier, Hassan Jameel Asghar and Dali Kaafar “Private Continual Release of Real-Valued Data Streams” In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019 The Internet Society, 2019
[Dwo+10a] Cynthia Dwork et al. “Pan-Private Streaming Algorithms” In ICS Tsinghua University Press, 2010, pp. 66–80
[Cum+21] Rachel Cummings, Vitaly Feldman, Audra McMillan and Kunal Talwar “Mean Estimation with User-level Privacy under Data Heterogeneity” In NeurIPS 2021 Workshop Privacy in Machine Learning, 2021 URL: https://openreview.net/forum?id=oYbQDV3mon-
[NME22] Shyam Narayanan, Vahab Mirrokni and Hossein Esfandiari “Tight and Robust Private Mean Estimation with Few Users” In Proceedings of the 39th International Conference on Machine Learning 162, Proceedings of Machine Learning Research PMLR, 2022, pp. 16383–16412
[FS17] Vitaly Feldman and Thomas Steinke “Generalization for Adaptively-chosen Estimators via Stable Median” In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017 65, Proceedings of Machine Learning Research PMLR, 2017, pp. 728–757
[Liu+20] Yuhan Liu et al. “Learning discrete distributions: user vs item-level privacy” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 20965–20976 URL: https://proceedings.neurips.cc/paper/2020/file/f06edc8ab534b2c7ecbd4c2051d9cb1e-Paper.pdf
[DR+14] Cynthia Dwork and Aaron Roth “The algorithmic foundations of differential privacy” In Foundations and Trends® in Theoretical Computer Science 9.3–4 Now Publishers, Inc., 2014, pp. 211–407

Appendix A Useful Inequalities

We state two concentration inequalities that we will use extensively.

Lemma A.1.

Let $x_{i}\stackrel{{\scriptstyle iid}}{{\sim}}\text{Ber}(\mu)$ for $i\in[n]$ . Then, for every $\delta\in(0,1]$ ,

\displaystyle\mathbb{P}\bigg{(}\bigg{\lvert}\sum_{i=1}^{n}x_{i}-n\mu\bigg{\rvert}\geq\sqrt{\frac{n}{2}\ln\frac{2}{\delta}}\bigg{)}\leq\delta.

Lemma A.2.

Let $x_{i}\sim\text{Lap}(b_{i})$ , $i\in[n]$ , be independent. Then, for every $\delta\in(0,1]$ ,

\displaystyle\mathbb{P}\bigg{(}\bigg{\lvert}\sum_{i=1}^{n}x_{i}\bigg{\rvert}\geq c\sqrt{\sum_{i=1}^{n}b_{i}^{2}}\ln\frac{1}{\delta}\bigg{)}\leq\delta,

where $c$ is an absolute constant.

Appendix B Continual mean estimation: Wishful scenario

Algorithm 3 is the algorithm under the wishful scenario where:

•

we already have a prior estimate $\tilde{\mu}$ that satisfies $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ ;
•

every user contributes their $m$ samples in contiguous time steps; that is, user $1$ contributes samples for the first $m$ time steps, followed by user 2 who contributes samples for the next $m$ time steps, and so on.

B.1 Guarantees for Algorithm 3

Let $x^{(u)}_{j}$ denote the $j$ -th sample contributed by user $u$ .

Privacy.

A user $u$ contributes at most one element to ${\rm BinMech.Stream}$ ; this element is $\Pi\left(\sum_{j=1}^{m}x_{j}^{(u)}\right)$ , where $\Pi(\cdot)$ is the projection on the interval $\mathcal{I}$ defined in (2). So, there are at most $n$ elements in ${\rm BinMech.Stream}$ throughout the course of the algorithm. From the way binary mechanism works, a given element in ${\rm BinMech.Stream}$ will be used at most $(1+\log n)$ times while computing terms in ${\rm BinMech.NoisyPartialSums}$ (see Section 2.3 in the main paper). Thus, changing a user can change the $\ell_{1}$ -norm of the array ${\rm BinMech.NoisyPartialSums}$ by at most $(1+\log n)(2\Delta)$ , where $\Delta$ is as in (2). Hence, adding independent ${\rm Lap}(\eta)$ noise (with $\eta$ as in (3)) while computing each term in ${\rm BinMech.NoisyPartialSums}$ is sufficient to ensure that the array ${\rm BinMech.NoisyPartialSums}$ remains user-level $\varepsilon$ -DP throughout the course of the algorithm. Since the output $\left(\hat{\mu}_{t}\right)_{t=1}^{T}$ is computed using ${\rm BinMech.Sum}$ , which, in turn is a function of the array ${\rm BinMech.NoisyPartialSums}$ , we conclude that Algorithm 3 is user-level $\varepsilon$ -DP.

Utility.

We first show that with high probability, the projection operator $\Pi(\cdot)$ plays no role throughout the algorithm. We then show utility guarantee for $t=km$ assuming no truncation, before generalizing the guarantee to arbitrary $t$ .

No truncation happens:

For a user $u$ , we have from Lemma A.1 that

\mathrm{Pr}\left(\left\lvert\sum_{j=1}^{m}x_{j}^{(u)}-m\mu\right\rvert\leq\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}\right)\geq 1-\frac{\delta}{n}.

Since $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ , this gives us

\forall u\in[n],\ \mathrm{Pr}\left(\left\lvert\sum_{j=1}^{m}x_{j}^{(u)}-m\tilde{\mu}\right\rvert\leq\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}+\sqrt{m}\right)\geq 1-\frac{\delta}{n}.

Thus, by union bound

\mathrm{Pr}\left(\forall u\in[n],\left\lvert\sum_{j=1}^{m}x_{j}^{(u)}-m\tilde{\mu}\right\rvert\leq\sqrt{\frac{m}{2}\ln\frac{2n}{\delta}}+\sqrt{m}\right)\geq 1-\delta.

This means, that with probability at least $1-\delta$ , we have $\Pi\left(\sum_{j=1}^{m}x_{j}^{(u)}\right)=\sum_{j=1}^{m}x_{j}^{(u)}$ for every $u\in[n]$ . That is, with probability at least $1-\delta$ , no truncation happens throughout the course of the algorithm.

Guarantee at $t=km$ ignoring projection $\Pi(\cdot)$ :

For now, consider Algorithm 3 without the projection operator $\Pi(\cdot)$ in Line 9. Call it Algorithm 3-NP (NP $\equiv$ ‘No Projection’). For Algorithm 3-NP, the following happens at $t=km$ , for integer $1\leq k\leq n$ : ${\rm BinMech.Stream}$ has elements Let $\sigma_{1}=\left(\sum_{j=1}^{m}x_{j}^{(1)}\right),\sigma_{2}=\left(\sum_{j=1}^{m}x_{j}^{(2)}\right),\ldots,\sigma_{k}=\left(\sum_{j=1}^{m}x_{j}^{(k)}\right)$ . Thus, ${\rm BinMech.NoisyPartialSums}$ also contains $k$ terms, where each term has independent $\text{Lap}(\eta)$ added to it. Hence, ${\rm BinMech.Sum}$ would be computed using at most $1+\log k$ terms from ${\rm BinMech.NoisyPartialSums}$ , and so, using Lemma A.2, we have

\mathrm{Pr}\left(\left\lvert S_{km}-\sum_{i=1}^{k}\sigma_{i}\right\rvert\leq c\eta\sqrt{1+\log k}\ln\frac{1}{\delta}\right)\geq 1-\delta.

Furthermore, using Lemma A.1, we have for $t=km$ that

\mathrm{Pr}\left(\left\lvert\sum_{i=1}^{k}\sigma_{i}-\mu km\right\rvert\leq\sqrt{\frac{km}{2}\ln\frac{2}{\delta}}\right)\geq 1-\delta.

Thus, we have the following at $t=km$ :

\mathrm{Pr}\left(\left\lvert S_{km}-\mu km\right\rvert\leq\sqrt{\frac{km}{2}\ln\frac{2}{\delta}}+c\eta\sqrt{1+\log k}\ln\frac{1}{\delta}\right)\geq 1-2\delta

or, dividing by $km$ , we have

\mathrm{Pr}\left(\left\lvert\hat{\mu}_{km}-\mu\right\rvert\leq\sqrt{\frac{1}{2km}\ln\frac{2}{\delta}}+c\frac{\eta}{km}\sqrt{1+\log k}\ln\frac{1}{\delta}\right)\geq 1-2\delta.

(12)

Guarantee at arbitrary $t\geq m$ ignoring projection $\Pi$ :

We now give utility guarantee of Algorithm 3-NP, for arbitrary $t\geq m$ . Note that for any $t\in[km,(k+1)m-1)$ , $k\geq 1$ , we output $\hat{\mu}_{t}=\hat{\mu}_{km}$ . Moreover, we always have that $km\geq\frac{t}{2}$ . Thus, for $t\in[km,(k+1)m-1)$ , we have, with probability $\geq 1-2\delta$

$\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert$	$\displaystyle=\left\lvert\hat{\mu}_{km}-\mu\right\rvert$
	$\displaystyle\leq\sqrt{\frac{1}{2km}\ln\frac{2}{\delta}}+c\frac{\eta}{km}\sqrt{1+\log k}\ln\frac{1}{\delta}$
	$\displaystyle\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\frac{2\eta}{t}\sqrt{1+\log k}\ln\frac{1}{\delta}.$	(since $km\geq\frac{t}{2}$ )

Final utility guarantee for Algorithm 3 at arbitrary $t$ :

The above utility guarantee was obtained for Algorithm 3-NP, which is a variant of Algorithm 3 without projection operator $\Pi(\cdot)$ in Line 9. Now, let $\mathcal{E}_{1}$ be the event that no truncation happens. We already saw that no truncation happens (i.e. projection operator plays no role) with probability at least $1-\delta$ . That is, $\mathrm{Pr}(\mathcal{E}_{1})\geq 1-\delta$ . Observe that

	$\displaystyle\mathcal{E}_{1}$	$\displaystyle:=\left\{\text{no truncation happens}\right\}$
		$\displaystyle=\left\{\text{Algorithm~{}\ref{alg:wishful} and Algorithm~{}\ref{alg:wishful}-NP become equivalent}\right\}.$

Let

\mathcal{E}_{2}=\left\{\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\frac{2\eta}{t}\sqrt{1+\log k}\ln\frac{1}{\delta}\text{ for Algorithm~{}\ref{alg:wishful}-NP}\right\}

We saw above that $\mathrm{Pr}(\mathcal{E}_{2})\geq 1-2\delta$ . Thus, using union bound that $\mathrm{Pr}(\mathcal{E}_{1}\cap\mathcal{E}_{2})\geq 1-3\delta$ . That is, for Algorithm 3, we have that for $t\geq m$ , (we upper bound $\log k$ by $\log n$ )

\mathrm{Pr}\left(\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\frac{2\eta}{t}\sqrt{1+\log n}\ln\frac{1}{\delta}\right)\geq 1-3\delta.

Substituting value of $\eta$ from (3), we get that for $t\geq m$ , with probability at least $1-3\delta$ , we have

	$\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert$	$\displaystyle\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+\frac{\sqrt{m}}{t\varepsilon}\left(1+\sqrt{\frac{1}{2}\ln\frac{2n}{\delta}}\right){4c(1+\log n)^{3/2}}\ln\frac{1}{\delta}$
		$\displaystyle=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{m}}{t\varepsilon}\right).$

This guarantee holds trivially for $t<m$ as well because the algorithm outputs the prior estimate $\tilde{\mu}$ (Lines 3-5), which gives us that $\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}\leq\frac{1}{\sqrt{t}}$ .

Remark: Throughout this section, we assumed that we were given a prior estimate $\tilde{\mu}$ satisfying $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Our discussion here also applies to the case even when we have a prior that satisfies $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ with probability at least $1-\delta$ (instead of being deterministic). Also, if $\tilde{\mu}$ is computed using samples from users, it should be user-level DP. In particular, if $\tilde{\mu}$ is user-level $\varepsilon$ -DP, the overall algorithm becomes user-level $2\varepsilon$ -DP (using composition property of DP from Lemma 2.2).

Appendix C Continual mean estimation assuming prior estimate: Single binary mechanism

Before proving guarantees for Algorithm 4 (Theorem 3.1), we prove a claim about exponential withhold-release pattern with arbitrary user ordering.

C.1 Exponential withhold-release

Recall the exponential withhold-release pattern. For a given user $u$ , we release the first two samples $x_{1}^{(u)},x_{2}^{(u)}$ ; then, we withhold $x_{3}^{(u)}$ and release truncated version of $(x_{3}^{(u)}+x_{4}^{(u)})$ when we receive $x_{4}^{(u)}$ ; we then withhold $x_{5}^{(u)},x_{6}^{(u)},x_{7}^{(u)}$ and release truncated version of $(x_{5}^{(u)}+x_{6}^{(u)}+x_{7}^{(u)}+x_{8}^{(u)})$ when we receive $x_{8}^{(u)}$ ; and so on. In general, we withhold samples $x_{2^{\ell-1}+1}^{(u)},\ldots,x_{2^{\ell}-1}^{(u)}$ and release truncated version of $\left(\sum_{i=2^{\ell-1}+1}^{2^{\ell}}x_{i}^{(u)}\right)$ when we receive the $2^{\ell}$ -th sample $x_{2^{\ell}}^{(u)}$ from user $u$ .

We ignore truncations for now. For a user $u$ , let $\sigma_{0}^{(u)}=x_{1}^{(u)},$ $\sigma_{1}^{(u)}=x_{2}^{(u)},$ $\sigma_{2}^{(u)}=(x_{3}^{(u)}+x_{4}^{(u)}),\ldots,$ $\sigma_{\ell}^{(u)}~{}=~{}\left(\sum_{i=2^{\ell-1}+1}^{2^{\ell}}x_{i}^{(u)}\right),\ldots$ . Let $\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}$ be a stream with arbitrary user order. We follow the exponential withhold-release protocol and feed $\sigma_{\ell}^{(u)}$ ’s to an array named ${\rm Stream}$ . For instance, suppose we receive $(x_{1},1),(x_{2},2),(x_{3},2),(x_{4},2),(x_{5},1),(x_{6},2),(x_{7},1),(x_{8},1)$ , where the second index denotes the user identity. Then, the input feed to ${\rm Stream}$ looks as follows:

\begin{matrix}t=1:&x_{1}&(\text{1st sample from user }1)\\ t=2:&x_{2}&(\text{1st sample from user }2)\\ t=3:&x_{3}&(\text{2nd sample from user }2)\\ t=4:&{\tt withhold}&(\text{3rd sample from user }2)\\ t=5:&x_{5}&(\text{2nd sample from user }1)\\ t=6:&x_{4}+x_{6}&(\text{4th sample from user }2)\\ t=7:&{\tt withhold}&(\text{3rd sample from user }1)\\ t=8:&x_{7}+x_{8}&(\text{4th sample from user }1)\end{matrix}

(13)

Since we are only interested in computing sums, feeding $\sigma_{\ell}^{(u)}=\left(\sum_{i=2^{\ell-1}+1}^{2^{\ell}}x_{i}^{(u)}\right)$ to ${\rm Stream}$ is equivalent to feeding information about $2^{\ell}-1$ samples to ${\rm Stream}$ . We now claim the following.

Claim C.1.

Let $\big{(}(x_{t},u_{t})\big{)}_{t\in[T]}$ have arbitrary user ordering. Suppose we follow the exponential withhold-release protocol and feed $\sigma_{\ell}^{(u)}$ ’s to an array named ${\rm Stream}$ . Then, at any time $t$ , ${\rm Stream}$ contains information about at least $t/2$ samples.

Proof.

Let ‘R’ denote ‘release’ and ‘W’ denote ‘withhold’. Suppose, for now, only user $u$ arrives at all time steps. Then, the withhold-release sequence looks like $(R_{u},R_{u},W_{u},R_{u},3W_{u},R_{u},7W_{u},R_{u},\cdots)$ . Here, ‘ $kW_{u}$ ’ denotes ‘ $W_{u}$ ’ occurs for the next $k$ steps in the sequence. Moreover, at every ‘ $R_{u}$ ’, information about the samples withheld after previous ‘ $R_{u}$ ’ are released. Note that, at any point along this withhold-release sequence, the number of samples withheld is less than or equal to the number of samples whose information has been released. This is for a given user $u$ .

Now, consider a withhold-release sequence induced by an arbitrary user ordering. E.g., if the user order is $1,2,2,2,1,2,1,1$ , then the corresponding withhold-release sequence is $(R_{1},R_{2},R_{2},W_{2},R_{1},R_{2},W_{1},R_{1})$ (see (13)). Here, subscript denotes user ID. Now, at any time $t$ in this withhold-release sequence, if we just consider $R_{u}$ ’s and $W_{u}$ ’s up to time $t$ for a fixed user $u$ , we will have that (argued above) the number of samples withheld for user $u$ is less than or equal to the number of samples from user $u$ whose information has been released. This holds for every user who has occurred till time $t$ . Thus, at any time $t$ , total number of samples withheld (across all users) is less than or equal to total number of samples (across all users) whose information has been released. Hence, ${\rm Stream}$ contains information about at least $t/2$ samples. ∎

C.2 Proof of Theorem 3.1

We will prove the following theorem.

Theorem.

Assume that we are given a user-level $\varepsilon$ -DP prior estimate $\tilde{\mu}$ of the true mean $\mu$ , such that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Then, Algorithm 4 is $2\varepsilon$ -DP. Moreover, for any given $t\in[T]$ , we have with probability at least $1-\delta$ that

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{m}}{t\varepsilon}\right).

Proof.

We fist prove the privacy guarantee and then prove the utility guarantee.

Privacy.

The prior estimate $\tilde{\mu}$ is given to be user-level $\varepsilon$ -DP. We will now show that the array
${\rm BinMech.NoisyPartialSums}$ is user-level $\varepsilon$ -DP throughout the course of the algorithm. Note that the output estimates $\hat{\mu}_{t}$ are computed using ${\rm BinMech.Sum}$ , which is a function of ${\rm BinMech.NoisyPartialSums}$ . Thus, if $\tilde{\mu}$ and ${\rm BinMech.NoisyPartialSums}$ are both user-level $\varepsilon$ -DP, by composition property (Lemma 2.2), the overall output $\left(\hat{\mu}_{t}\right)_{t=1}^{T}$ will be user-level $2\varepsilon$ -DP.

Proof that ${\rm BinMech.NoisyPartialSums}$ is user-level $\varepsilon$ -DP:

Since a user $u$ contributes at most $m$ samples, and does so in an exponential withhold-release pattern, there are at most $(1+\log m)$ elements in ${\rm BinMech.Stream}$ corresponding to user $u$ . Since there are at most $n$ users, there are at most $n(1+\log m)$ elements added to ${\rm BinMech.Stream}$ throughout the course of the algorithm.

Now, since there can be at most $n(1+\log m)$ elements in ${\rm BinMech.Stream}$ , a given element in ${\rm BinMech.Stream}$ will be used at most $\log(1+n(1+\log m))$ times while computing ${\rm BinMech.NoisyPartialSums}$ . Thus, changing a user can change the $\ell_{1}$ -norm of ${\rm BinMech.NoisyPartialSums}$ by at most $(1+\log m)(\log(1+n(1+\log m)))\Delta$ , where $\Delta$ is the maximum sensitivity of an element contributed by a user to ${\rm BinMech.Stream}$ . As can be seen from (4), we have $\Delta_{\ell}\leq\left(\sqrt{\frac{m}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{m}\right)$ for every $\ell$ . Thus, $\Delta=2\left(\sqrt{\frac{m}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{m}\right)$ is an upper bound on worst-case sensitivity of an element contributed by a user to ${\rm BinMech.Stream}$ . Hence, adding independent ${\rm Lap}(\eta(m,n,\delta))$ noise (with $\eta(m,n,\delta)$ as in (5)) while computing each term in ${\rm BinMech.NoisyPartialSums}$ is sufficient to ensure that ${\rm BinMech.NoisyPartialSums}$ is user-level $\varepsilon$ -DP.

Utility.

We first show that with high probability, the projection operators $\Pi_{\ell}(\cdot)$ play no role throughout the course of the algorithm.

No truncation happens:

For a user $u$ , for any $\ell\geq 1$ , we have from Lemma A.1 that

\mathrm{Pr}\left(\left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\mu\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}\right)\geq 1-\frac{\delta}{n\log m}.

Since $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ , this gives us that, for any user $u$ , for any $\ell\geq 1$ ,

\mathrm{Pr}\left(\left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\tilde{\mu}\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}\right)\geq 1-\frac{\delta}{n}.

Note that a user contributes at most $m$ samples. Thus, at most $\log m$ projection operators $\Pi_{\ell}(\cdot)$ are applied per user. Applying union bound (over $\ell$ ), we have that, for any user $u$

\mathrm{Pr}\left(\forall\ell\in\left\{1,\ldots,\lfloor\log m\rfloor\right\},\left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\tilde{\mu}\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}\right)\geq 1-\frac{\delta}{n}.

Now, we take union bound over $n$ users, which gives us that

\mathrm{Pr}\left(\forall u\in[n],\forall\ell\in\left\{1,\ldots,\lfloor\log m\rfloor\right\},\left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\tilde{\mu}\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\frac{2^{\ell-1}}{\sqrt{m}}\right)\geq 1-\delta.

Note that the projection operator $\Pi_{\ell}(\cdot)$ was defined as projection on interval $\mathcal{I}_{\ell}$ as in (4). The above equation shows that with probability at least $1-\delta$ , the projection operators do not play any role throughout the algorithm, and thus no truncation happens.

Utility at time $t$ ignoring projections $\Pi_{\ell}$ :

For now, consider Algorithm 4 without the projection operator $\Pi_{\ell}(\cdot)$ in Line 11. Call it Algorithm 4-NP (NP $\equiv$ ‘No Projection’). For Algorithm 4-NP, at any time $t$ , ${\rm BinMech.Stream}$ has terms of the form $\sigma_{\ell}^{(u)}=\left(\sum_{i=2^{\ell-1}+1}^{2^{\ell}}x_{i}^{(u)}\right)$ (note that $\sigma_{\ell}^{(u)}$ “contains information” about $2^{\ell-1}$ samples from user $u$ ). From Claim C.1, we have that, at time $t$ , ${\rm BinMech.Stream}$ contains information about at least $t/2$ samples. Thus, $S_{t}$ would be sum of at least $t/2$ samples with Laplace noises added to it.

Note that, at any time $t$ , at most $t$ elements are present in ${\rm BinMech.Stream}$ . This also means that at most $t$ elements are present in ${\rm BinMech.NoisyPartialSums}$ . Thus, computing $S_{t}$ (using ${\rm BinMech.Sum}$ ) would involve at most $(1+\log t)$ terms from ${\rm BinMech.NoisyPartialSums}$ . Each term in ${\rm BinMech.NoisyPartialSums}$ has independent ${\rm Lap}(\eta(m,n,\delta))$ noise added to it, where $\eta(m,n,\delta)$ is as in (5).

Thus, at time $t$ , for Algorithm 4-NP,

S_{t}=\left[\text{sum of at least $t/2$ Bernoulli samples}\right]+\left[\text{sum of at most $(1+\log t)$ i.i.d. ${\rm Lap}(\eta(m,n,\delta))$ terms}\right]

Hence, using Lemma A.1 and Lemma A.2, we get that at time $t$ ,

\mathrm{Pr}\left(\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\eta(m,n,\delta)\sqrt{1+\log t}\ln\frac{1}{\delta}\right)\geq 1-2\delta

Final utility guarantee for Algorithm 4 at time $t$ :

The above utility guarantee was obtained for Algorithm 4-NP, which is a variant of Algorithm 4 without projection operator $\Pi_{\ell}(\cdot)$ in Line 11. Now, let $\mathcal{E}_{1}$ be the event that no truncation happens. We already saw that no truncation happens (i.e. projection operator plays no role) with probability at least $1-\delta$ . That is, $\mathrm{Pr}(\mathcal{E}_{1})\geq 1-\delta$ . Observe that

	$\displaystyle\mathcal{E}_{1}$	$\displaystyle:=\left\{\text{no truncation happens}\right\}$
		$\displaystyle=\left\{\text{Algorithm~{}\ref{alg:singcounter} and Algorithm~{}\ref{alg:singcounter}-NP become equivalent}\right\}.$

Let

\mathcal{E}_{2}=\left\{\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\eta(m,n,\delta)\sqrt{1+\log t}\ln\frac{1}{\delta}\text{ for Algorithm~{}\ref{alg:singcounter}-NP}\right\}

We saw above that $\mathrm{Pr}(\mathcal{E}_{2})\geq 1-2\delta$ . Thus, using union bound that $\mathrm{Pr}(\mathcal{E}_{1}\cap\mathcal{E}_{2})\geq 1-3\delta$ . That is, for Algorithm 4, we have, with probability at least $1-3\delta$ ,

$\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert$	$\displaystyle\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+c\frac{\eta(m,n,\delta)}{t}\sqrt{1+\log t}\ln\frac{1}{\delta}$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\eta(m,n,\delta)}{t}\sqrt{\log t}\ln\frac{1}{\delta}\right)$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\sqrt{m}}{t\varepsilon}\sqrt{\log t}\log m\log(n\log m)\sqrt{\ln\frac{n\log m}{\delta}}\ln\frac{1}{\delta}\right)$	(using $\eta(m,n,\delta)$ from (5))
	$\displaystyle=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{m}}{t\varepsilon}\right).$

∎

Appendix D Continual mean estimation assuming prior: Multiple binary mechanisms

D.1 Proof of Theorem 3.2

We will prove the following theorem.

Theorem.

Assume that we are given a user-level $\varepsilon$ -DP prior estimate $\tilde{\mu}$ of the true mean $\mu$ , such that $\left\lvert\tilde{\mu}-\mu\right\rvert\leq\frac{1}{\sqrt{m}}$ . Then, Algorithm 5 is $2\varepsilon$ -DP. Moreover, for any given $t\in[T]$ , we have with probability at least $1-\delta$ that

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right),

Proof.

We fist prove the privacy guarantee and then prove the utility guarantee. Let $L:=\lfloor\log m\rfloor$ .

Privacy.

The prior estimate $\tilde{\mu}$ is given to be user-level $\varepsilon$ -DP. We will now show that, for each $\ell\in\left\{0,\ldots,L\right\}$ , the array ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ is user-level $\frac{\varepsilon}{L+1}$ -DP. By composition property (Lemma 2.2), this would mean that $\Big{(}{\rm BinMech}[\ell].{\rm NoisyPartialSums}\Big{)}_{\ell=0}^{L}$ is user-level $\varepsilon$ -DP. Note that the output estimates $\hat{\mu}_{t}$ are computed using $\Big{(}{\rm BinMech}[\ell].{\rm Sum}\Big{)}_{\ell=0}^{L}$ , which is a function of
$\Big{(}{\rm BinMech}[\ell].{\rm NoisyPartialSums}\Big{)}_{\ell=0}^{L}$ . Thus, if $\Big{(}{\rm BinMech}[\ell].{\rm NoisyPartialSums}\Big{)}_{\ell=0}^{L}$ is user-level $\varepsilon$ -DP, and prior estimate $\tilde{\mu}$ is also user-level $\varepsilon$ -DP, the overall output $\left(\hat{\mu}_{t}\right)_{t=1}^{T}$ is guaranteed to be user-level $2\varepsilon$ -DP.

Proof that ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ is user-level $\frac{\varepsilon}{L+1}$ -DP:

Consider ${\rm BinMech}[\ell]$ . A user $u$ contributes at most one element to ${\rm BinMech}[\ell].{\rm Stream}$ ; this element is $\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u)}_{j}\right)$ , where $\Pi_{\ell}(\cdot)$ is the projection on the interval $\mathcal{I}_{\ell}$ defined in 4. So, there are at most $n$ elements in ${\rm BinMech}[\ell].{\rm Stream}$ throughout the course of the algorithm. A given element in ${\rm BinMech}[\ell].{\rm Stream}$ will be used at most $(1+\log n)$ times while computing terms in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ . Thus, changing a user can change the $\ell_{1}$ -norm of the array
${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ by at most $(1+\log n)(2\Delta_{\ell})$ , where $\Delta_{\ell}$ is as in (4). Hence, adding independent ${\rm Lap}(\eta(m,n,\ell,\delta))$ noise (with $\eta(m,n,\ell,\delta)$ as in (7)) while computing each term in
${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ is sufficient to ensure that the array ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ remains user-level $\frac{\varepsilon}{L+1}$ -DP throughout the course of the algorithm.

Utility.

Exactly as in the utility proof of Theorem 3.1 (Section C.2), we have that with probability at least $1-\delta$ , the projection operators do not play any role throughout the algorithm, and thus no truncation happens.

Utility at time $t$ ignoring projections $\Pi_{\ell}$ :

For now, consider Algorithm 5 without the projection operator $\Pi_{\ell}(\cdot)$ in Line 11. Call it Algorithm 5-NP (NP $\equiv$ ‘No Projection’). We will derive utility at time $t$ for Algorithm 5-NP.

The only difference between Algorithm 5-NP and Algorithm 4-NP is that the term $\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u)}_{j}\right)$ from user $u$ is fed to ${\rm BinMech}[\ell].{\rm Stream}$ instead of feeding it to a common binary mechanism stream. So, for Algorithm 5-NP, using Claim C.1, we have that, at time $t$ , the combined streams $\big{(}{\rm BinMech}[\ell].{\rm Stream}\big{)}_{\ell=0}^{L}$ contain information about at least $t/2$ samples. Thus, $S_{t}$ would be sum of at least $t/2$ samples with Laplace noises added to it.

Now, since every user has contributed at most $M_{t}$ samples, it follows that ${\rm BinMech}[\ell].{\rm Stream}$ for $\ell>\lfloor\log M_{t}\rfloor$ will be empty. That is, only ${\rm BinMech}[0],\ldots,{\rm BinMech}[\lfloor\log M_{t}\rfloor]$ will contribute to the sum $S_{t}$ at time $t$ . Moreover, for $\ell\leq\lfloor\log M_{t}\rfloor$ , since each user contributes at most one item to ${\rm BinMech}[\ell].{\rm Stream}$ , there can be at most $n$ terms in ${\rm BinMech}[\ell].{\rm Stream}$ and in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ . Thus, computing ${\rm BinMech}[\ell].{\rm Sum}$ at time $t$ involves at most $(1+\log n)$ terms from ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ . Each term in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ has independent ${\rm Lap}(\eta(m,n,\ell,\delta))$ noise added to it, where $\eta(m,n,\ell,\delta)$ is as in (7).

Thus, at time $t$ , for Algorithm 5-NP,

S_{t}=\left[\text{sum of at least $t/2$ Bernoulli samples}\right]+\left[\sum_{\ell=0}^{\lfloor\log M_{t}\rfloor}\text{sum of at most $(1+\log n)$ i.i.d. ${\rm Lap}(\eta(m,n,\ell,\delta))$ terms}\right]

Hence, using Lemma A.1 and Lemma A.2, we get that at time $t$ ,

\mathrm{Pr}\left(\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+\frac{c}{t}\ln\frac{1}{\delta}\sqrt{\sum_{\ell=0}^{\lfloor\log M_{t}\rfloor}(1+\log n)\eta(m,n,\ell,\delta)^{2}}\right)\geq 1-2\delta

Final utility guarantee for Algorithm 5 at time $t$ :

The above utility guarantee was obtained for Algorithm 5-NP, which is a variant of Algorithm 5 without projection operator $\Pi_{\ell}(\cdot)$ in Line 11. But, indeed, with probability at least $1-\delta$ , no truncation happens. Proceeding as in the proof of Theorem 3.1 (Section C.2), we take a union bound, and get that, with probability at least $1-3\delta$ , Algorithm 5 satisfies

$\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert$	$\displaystyle\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+\frac{c}{t}\ln\frac{1}{\delta}\sqrt{\sum_{\ell=0}^{\lfloor\log M_{t}\rfloor}(1+\log n)\eta(m,n,\ell,\delta)^{2}}$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\eta(m,n,\lfloor\log M_{t}\rfloor,\delta)}{t}\sqrt{\log n}\ln\frac{1}{\delta}\right)$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\sqrt{M_{t}}}{t\varepsilon}(\log m)(\log n)^{3/2}\sqrt{\ln\frac{n\log m}{\delta}}\ln\frac{1}{\delta}\right)$	(from (7))
	$\displaystyle=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right).$

∎

Appendix E Continual mean estimation: Full algorithm

E.1 Private Median Algorithm

Claim E.1.

For any $\ell\geq 1$ , Algorithm 7 ( ${\rm PrivateMedian}$ ) is user-level $\varepsilon$ -DP. Moreover, with probability at least $1-\delta-\beta$ ,

\left\lvert\tilde{\mu}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k(\varepsilon,\ell,\beta)}{\delta}}

where $k(\varepsilon,\ell,\beta)=\frac{16}{\varepsilon}\ln\frac{2^{\ell/2}}{\beta}$ .

Proof.

Let $k:=k(\varepsilon,\ell,\beta)$ .

Privacy.

From the way arrays $S_{1},\ldots,S_{k}$ in Algorithm 7 are created (Lines 3-8), it follows that samples from any given user $u$ appears in at most $2$ arrays. This is because: (i) each array contains $2^{\ell-1}$ samples; and (ii) each user contributes at most $2^{\ell-1}$ samples (see definition of $r$ in Line 4); (iii) samples from a user are added contiguously to arrays (see Lines 5-6). Now, for $j\in[k]$ , since $Y_{j}$ is the average of samples in array $S_{j}$ , and $Y^{\prime}_{j}$ is a quantized version of $Y_{j}$ , it follows that changing a user changes at most $2$ elements out of $\left\{Y^{\prime}_{1},\ldots,Y^{\prime}_{k}\right\}$ . Thus, for any $y\in\mathcal{T}$ , the cost $c(y)$ can vary by at most $2$ if a user is changed. Since the worst-case sensitivity (w.r.t. change of a user) of cost $c$ is $\Delta c:=2$ , exponential mechanism with sampling probability proportional to $\exp\left(-\frac{\varepsilon}{2\Delta c}c(y)\right)$ is $\varepsilon$ -DP ([DR+14]) w.r.t. change of a user. This proves that Algorithm 7 is user-level $\varepsilon$ -DP.

Utility.

Since for each $j\in[k]$ , $Y_{j}$ is sample mean of $2^{\ell-1}$ Bernoulli random variables, we have by Lemma A.1 that

\forall j\in[k],\ \mathrm{Pr}\left(\left\lvert Y_{j}-\mu\right\rvert\leq\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k}{\delta}}\right)\geq 1-\frac{\delta}{k}.

Thus, by union bound,

\mathrm{Pr}\left(\forall j\in[k],\ \left\lvert Y_{j}-\mu\right\rvert\leq\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k}{\delta}}\right)\geq 1-\delta.

Since $\left\lvert Y^{\prime}_{j}-\mu\right\rvert\leq\left\lvert Y^{\prime}_{j}-Y_{j}\right\rvert+\left\lvert Y_{j}-\mu\right\rvert$ , and $\left\lvert Y^{\prime}_{j}-Y_{j}\right\rvert\leq 2^{-\ell/2}$ , it follows that

\mathrm{Pr}\left(\forall j\in[k],\ \left\lvert Y^{\prime}_{j}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k}{\delta}}\right)\geq 1-\delta.

(14)

Now, from Theorem 3.1 in [FS17], it follows that the exponential mechanism outputs $\tilde{\mu}$ which is a $\left(1/4,3/4\right)$ -quantile of $Y^{\prime}_{1},\ldots,Y^{\prime}_{k}$ with probability at least $1-\beta$ . (In the statement of Theorem 3.1 in [FS17], the condition “ $m\geq 4\ln(\left\lvert T\right\rvert/\beta)/\varepsilon\alpha$ ” becomes, in our case, $k\geq 16\ln(\left\lvert\mathcal{T}\right\rvert/\beta)/\varepsilon$ after substituting $m=k$ , $\alpha=2$ , and accounting for the fact that the cost $c(y)$ has sensitivity $2$ w.r.t. change of a user).

If $\left\lvert Y^{\prime}_{j}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k}{\delta}}$ holds $\forall j\in[k]$ , it must also hold for $\left(1/4,3/4\right)$ -quantile of $Y^{\prime}_{1},\ldots,Y^{\prime}_{k}$ . Thus, from (14) and the fact that $\tilde{\mu}$ is a $\left(1/4,3/4\right)$ -quantile ¹¹1For a dataset $s\in\mathbb{R}^{n}$ , a $\left(1/4,3/4\right)$ -quantile is any $v\in\mathbb{R}$ such that $\left\lvert\left\{i\in[n]:s_{i}\leq v\right\}\right\rvert>\frac{n}{4}$ and $\left\lvert\left\{i\in[n]:s_{i}<v\right\}\right\rvert<\frac{3n}{4}$ . of $Y^{\prime}_{1},\ldots,Y^{\prime}_{k}$ with probability at least $1-\beta$ , we get using union bound that

\mathrm{Pr}\left(\left\lvert\tilde{\mu}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k}{\delta}}\right)\geq 1-\delta-\beta.

∎

E.2 Proof of Theorem 3.4

We will prove the following theorem.

Theorem.

Algorithm 6 for continual Bernoulli mean estimation is user-level $\varepsilon$ -DP. Moreover, if at time $t\in[T]$ ,

\sum_{u=1}^{n}\min\left\{m_{u}(t),\frac{M_{t}}{2}\right\}\geq\frac{M_{t}}{2}\frac{16}{\varepsilon}\left(2L\ln\frac{3L\sqrt{M_{t}}}{\delta}\right)\quad(\text{diversity condition})

then, with probability at least $1-\delta$ ,

\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right).

Here, $m_{u}(t)$ is the number of samples obtained from user $u$ till time $t$ , and $M_{t}=\max\left\{m_{u}(t):u\in[n]\right\}$ .

Proof.

Let $L:=\lceil\log m\rceil$ .

Privacy.

Algorithm 6 uses samples to:

(i)

compute $\tilde{\mu}_{\ell}$ using ${\rm PrivateMedian}\left((x_{i},u_{i})_{i=1}^{t},\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L}\right)$ , for $\ell\in\left\{2,\ldots,L\right\}$ ; and
(ii)

compute the array ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ for $\ell\in\left\{0,\ldots,L\right\}$ .

The output $\left(\hat{\mu}_{t}\right)_{t=1}^{T}$ is computed using ${\rm BinMech}[\ell].{\rm sum}$ , which, in turn is a function of the array
${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ , $\ell\in\left\{0,\ldots,L\right\}$ .

From Claim E.1, we get that $\tilde{\mu}_{\ell}={\rm PrivateMedian}\left((x_{i},u_{i})_{i=1}^{t},\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L}\right)$ is user-level $\frac{\varepsilon}{2L}$ -DP. Thus, from composition property of DP, we get that $(\tilde{\mu}_{2},\ldots,\tilde{\mu}_{L})$ is user-level $\frac{\varepsilon}{2}$ -DP.

Consider ${\rm BinMech}[\ell]$ . A user $u$ contributes at most one element to ${\rm BinMech}[\ell].{\rm Stream}$ ; this element is $\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u)}_{j}\right)$ , where $\Pi_{\ell}(\cdot)$ is the projection on the interval $\mathcal{I}_{\ell}$ defined in 8. So, there are at most $n$ elements in ${\rm BinMech}[\ell].{\rm Stream}$ throughout the course of the algorithm. Now, a given element in ${\rm BinMech}[\ell].{\rm Stream}$ will be used at most $(1+\log n)$ times while computing terms in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ . Thus, changing a user can change the $\ell_{1}$ -norm of the array ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ by at most $(1+\log n)(2\Delta_{\ell})$ , where $\Delta_{\ell}$ is as in (9). Hence, adding independent ${\rm Lap}(\eta(m,n,\ell,\delta))$ noise (with $\eta(m,n,\ell,\delta)$ as in (10)) while computing each term in ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ is sufficient to ensure that the array ${\rm BinMech}[\ell].{\rm NoisyPartialSums}$ remains user-level $\frac{\varepsilon}{2(L+1)}$ -DP throughout the course of the algorithm. Since there are $L+1$ binary mechanisms, by composition property of DP, we get that overall $\left({\rm BinMech}[\ell].{\rm NoisyPartialSums}\right)_{\ell=0}^{L}$ is user-level $\frac{\varepsilon}{2}$ -DP.

Since $(\tilde{\mu}_{2},\ldots,\tilde{\mu}_{L})$ is user-level $\frac{\varepsilon}{2}$ -DP, and $\left({\rm BinMech}[\ell].{\rm NoisyPartialSums}\right)_{\ell=0}^{L}$ is user-level $\frac{\varepsilon}{2}$ -DP, we again use composition property to conclude that the output $(\hat{\mu}_{t})_{t=1}^{T}$ by Algorithm 6 is user-level $\varepsilon$ -DP.

Utility.

At time $t$ , $M_{t}$ is the maximum number of samples contributed by any user. We will call ${\rm BinMech}[\ell]$ “active” at time $t$ , if there are sufficient number of users and samples so that $\tilde{\mu}_{\ell}$ can be obtained using ${\rm PrivateMedian}\left((x_{i},u_{i})_{i=1}^{t},\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L}\right)$ (Line 10 of Algorithm 6). Recall that we need $\tilde{\mu}_{\ell}$ to create truncation interval $\mathcal{I}_{\ell}$ (see (8)). We know that, for a given user $u$ , $x_{1}^{(u)}$ goes to ${\rm BinMech}[0].{\rm Stream}$ , $x_{2}^{(u)}$ goes to ${\rm BinMech}[1].{\rm Stream}$ , and $\Pi_{\ell}\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x^{(u)}_{j}\right)$ goes to ${\rm BinMech}[\ell].{\rm Stream}$ for $\ell\geq 2$ , provided ${\rm BinMech}[\ell]$ is “active”. Thus, at time $t$ , since the maximum number of samples contributed by any user is $M_{t}$ , we would like all binary mechanisms till ${\rm BinMech}[L_{t}]$ to be “active”, where $L_{t}:=\lfloor\log M_{t}\rfloor$ . Condition (11) guarantees that we have sufficient number of users and samples to obtain $\tilde{\mu}_{2},\ldots,\tilde{\mu}_{L_{t}}$ via ${\rm PriateMedian}$ (Algorithm 7), thus ensuring that every truncation required at time $t$ is indeed possible.

All $\tilde{\mu}_{\ell}$ ’s are “good”:

Suppose diversity condition (11) holds. Then, for each $\ell\in\left\{2,\ldots,L\right\}$ , ${\rm PrivateMedian}\left((x_{i},u_{i})_{i=1}^{t},\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L}\right)$ outputs a “good” $\tilde{\mu}_{\ell}$ satisfying

\left\lvert\tilde{\mu}_{\ell}-\mu\right\rvert\leq 2\sqrt{\frac{1}{2^{\ell}}\ln\frac{2k(\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L})}{\delta/3L}}

with probability at least $1-\frac{2\delta}{3L}$ (see Claim E.1). Thus, by union bound, we have the following: with probability at least $1-\frac{2\delta}{3}$ , $\tilde{\mu}_{\ell}$ is “good” for every $\ell\in\left\{2,\ldots,L\right\}$ .

No truncation happens:

For a user $u$ , for any $\ell\geq 1$ , we have from Lemma A.1 that

\mathrm{Pr}\left(\left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\mu\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta/3}}\right)\geq 1-\frac{\delta}{3n\log m}.

Taking union bound over $n$ users, we have that

\mathrm{Pr}\left(\forall u\in[n],\ \left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\mu\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta/3}}\right)\geq 1-\frac{\delta}{3\log m}.

Now, since all $\tilde{\mu}_{\ell}$ ’s are “good” with probablity at least $1-\frac{2\delta}{3}$ , we get using union bound that

\mathrm{Pr}\left(\forall u\in[n],\forall\ell\in[L],\ \left\lvert\left(\sum_{j=2^{\ell-1}+1}^{2^{\ell}}x_{j}^{(u)}\right)-(2^{\ell-1})\tilde{\mu}_{\ell}\right\rvert\leq\sqrt{\frac{2^{\ell-1}}{2}\ln\frac{2n\log m}{\delta}}+\sqrt{2^{\ell}\ln\frac{2k(\frac{\varepsilon}{2L},\ell,\frac{\delta}{3L})}{\delta/3L}}\right)\geq 1-\delta.

Note that the projection operator $\Pi_{\ell}(\cdot)$ was defined as projection on interval $\mathcal{I}_{\ell}$ as in (8). The above equation shows that with probability at least $1-\delta$ , the projection operators do not play any role throughout the algorithm, and thus no truncation happens.

Utility at time $t$ ignoring projections $\Pi_{\ell}$ :

For now, consider Algorithm 6 without the projection operator $\Pi_{\ell}(\cdot)$ in Line 22. Call it Algorithm 6-NP (NP $\equiv$ ‘No Projection’). We will derive utility at time $t$ for Algorithm 6-NP.

Note that Algorithm 6 -NP does not require $\tilde{\mu}_{\ell}$ ’s. Thus, utility of Algorithm 6-NP would be same as the utility of Algorithm 5 -NP in the proof of Theorem 3.2 (Section D.1). Hence, at time $t$ , for Algorithm 6-NP,

\mathrm{Pr}\left(\left\lvert\hat{\mu}_{t}-\mu\right\rvert\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+\frac{c}{t}\ln\frac{1}{\delta}\sqrt{\sum_{\ell=0}^{\lfloor\log M_{t}\rfloor}(1+\log n)\eta(m,n,\ell,\delta)^{2}}\right)\geq 1-2\delta

where $\eta(m,n,\ell,\delta)$ is as in (10).

Final utility guarantee for Algorithm 6 at time $t$ :

The above utility guarantee was obtained for Algorithm 6-NP, which is a variant of Algorithm 6 without projection operator $\Pi_{\ell}(\cdot)$ in Line 22. But, as argued above, with probability at least $1-\delta$ , no truncation happens. Thus, proceeding as in the proof of Theorem 3.1 (Section C.2), we take a union bound, and get that, with probability at least $1-3\delta$ , Algorithm 6 satisfies

$\displaystyle\left\lvert\hat{\mu}_{t}-\mu\right\rvert$	$\displaystyle\leq\sqrt{\frac{1}{t}\ln\frac{2}{\delta}}+\frac{c}{t}\ln\frac{1}{\delta}\sqrt{\sum_{\ell=0}^{\lfloor\log M_{t}\rfloor}(1+\log n)\eta(m,n,\ell,\delta)^{2}}$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\eta(m,n,\lfloor\log M_{t}\rfloor,\delta)}{t}\sqrt{\log n}\ln\frac{1}{\delta}\right)$
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\Delta_{\lfloor\log M_{t}\rfloor}}{t\varepsilon}(\log m)(\log n)^{3/2}\ln\frac{1}{\delta}\right)$	(from (10))
	$\displaystyle=O\left(\sqrt{\frac{1}{t}\ln\frac{1}{\delta}}+\frac{\sqrt{M_{t}}}{t\varepsilon}(\log m)(\log n)^{3/2}\left(\sqrt{\ln\frac{n\log m}{\delta}}+\sqrt{\ln\left(\frac{(\log m)^{2}}{\varepsilon\delta}\ln\frac{\sqrt{M_{t}}}{\delta}\right)}\right)\ln\frac{1}{\delta}\right)$	(substituting for $\Delta_{\lfloor\log M_{t}\rfloor}$ from (8))
	$\displaystyle=\tilde{O}\left(\frac{1}{\sqrt{t}}+\frac{\sqrt{M_{t}}}{t\varepsilon}\right).$

∎