Memory-Sample Lower Bounds for Learning with Classical-Quantum Hybrid Memory

Qipeng Liu Simons Institute for Theory of Computing. E-mail: [email protected]. Research supported in part by the Simons Institute for the Theory of Computing, through a Quantum Postdoctoral Fellowship, by the DARPA SIEVE-VESPA grant No.HR00112020023 and by the NSF QLCI program through grant number OMA-2016245. Ran Raz Princeton University. E-mail: [email protected]. Research supported by a Simons Investigator Award and by the National Science Foundation grants No. CCF-1714779, CCF-2007462. Wei Zhan Princeton University. E-mail: [email protected]. Research supported by a Simons Investigator Award and by the National Science Foundation grants No. CCF-1714779, CCF-2007462.

Abstract

In a work by Raz (J. ACM and FOCS 16), it was proved that any algorithm for parity learning on $n$ bits requires either $\Omega(n^{2})$ bits of classical memory or an exponential number (in $n$ ) of random samples. A line of recent works continued that research direction and showed that for a large collection of classical learning tasks, either super-linear classical memory size or super-polynomially many samples are needed. All these works consider learning algorithms as classical branching programs, which perform classical computation within bounded memory.

However, these results do not capture all physical computational models, remarkably, quantum computers and the use of quantum memory. It leaves the possibility that a small piece of quantum memory could significantly reduce the need for classical memory or samples and thus completely change the nature of the classical learning task. Despite the recent research on the necessity of quantum memory for intrinsic quantum learning problems like shadow tomography and purity testing, the role of quantum memory in classical learning tasks remains obscure.

In this work, we study classical learning tasks in the presence of quantum memory. We prove that any quantum algorithm with both, classical memory and quantum memory, for parity learning on $n$ bits, requires either $\Omega(n^{2})$ bits of classical memory or $\Omega(n)$ bits of quantum memory or an exponential number of samples. In other words, the memory-sample lower bound for parity learning remains qualitatively the same, even if the learning algorithm can use, in addition to the classical memory, a quantum memory of size $cn$ (for some constant $c>0$ ).

Our result is more general and applies to many other classical learning tasks. Following previous works, we represent by the matrix $M:A\times X\to\{-1,1\}$ the following learning task. An unknown $x$ is sampled uniformly at random from a concept class $X$ , and a learning algorithm tries to uncover $x$ by seeing streaming of random samples $(a_{i},b_{i}=M(a_{i},x))$ where for every $i$ , $a_{i}\in A$ is chosen uniformly at random. Assume that $k,\ell,r$ are integers such that any submatrix of $M$ of at least $2^{-k}\cdot|A|$ rows and at least $2^{-\ell}\cdot|X|$ columns, has a bias of at most $2^{-r}$ . We prove that any algorithm with classical and quantum hybrid memory for the learning problem corresponding to $M$ needs either (1) $\Omega(k\cdot\ell)$ bits of classical memory, or (2) $\Omega(r)$ qubits of quantum memory, or (3) $2^{\Omega(r)}$ random samples, to achieve a success probability at least $2^{-O(r)}$ .

Our results refute the possibility that a small amount of quantum memory significantly reduces the size of classical memory needed for efficient learning on these problems. Our results also imply improved security of several existing cryptographical protocols in the bounded-storage model (protocols that are based on parity learning on $n$ bits), proving that security holds even in the presence of a quantum adversary with at most $cn^{2}$ bits of classical memory and $cn$ bits of quantum memory (for some constant $c>0$ ).

1 Introduction

Memory plays an important role in learning. Starting from the seminal works by Shamir [Sha14] and Steinhardt, Valiant and Wager [SVW16], a sequence of works initiates and deepens the study of lower bounds for learning under memory constraints. Steinhardt, Valiant, and Wager [SVW16] conjectured that in order to learn an unknown $n$ -bit string from samples of random-subset parity, an algorithm needs either memory-size quadratic in $n$ or exponentially many random samples (also in $n$ ). This conjecture was later on proved by Raz [Raz18], showing for the first time that for some learning problems, super-linear memory size is required for efficient learning. This result was then generalized to a broad class of learning problems [KRT17, Raz17, MM18, BGY18, GRT18, SSV19, GKR20, GKLR21].

Although we have a comprehensive understanding of the (in)feasibility of learning under limitations on particular computation resource (memory), the previous works mentioned above do not capture all physical computational models; most notably, quantum computation and the power of quantum memory. Many researchers believe that large-scale quantum computers will eventually become viable. Recent experiments demonstrated quantum advantages, for example [AAB⁺19], and suggested that there are possibly no fundamental barriers to achieving quantum memory and quantum computers. Questions on the role of quantum memory in learning were proposed by Wright in the context of general state tomography [Wri16] and by Aaronson for shadow tomography [Aar18]. A line of works [HHJ⁺16, BCL20, HKP20, CCHL22, ACQ22, CLO22] pioneer the idea and show either polynomial or exponential separations for learning with/without quantum memory, but all for intrinsic quantum learning tasks like state tomography, shadow tomography and purity testing.

In light of the above, it is appealing to consider classical learning tasks in the presence of quantum memory, as well as hybrid classical-quantum memory. A direct implication of all aforementioned classical results only gives trivial results. As $k$ qubits of memory can always be efficiently simulated by $\sim 2^{k}$ classical bits, we can only conclude (say, for parity learning) that either $\sim 2\log n$ -qubit quantum memory or exponentially many samples are needed. Prior to our work, it could have been the case that even if only a very small size quantum memory was available, it might have significantly reduced the need for classical memory and led to an efficient learning algorithm.

In this work, we prove memory-sample lower bounds in the presence of hybrid memory for a wide collection of classical learning problems. As in [Raz17, GRT18], we will represent a learning problem by a matrix $M:A\times X\to\{-1,1\}$ whose columns correspond to concepts in the concept class $X$ and rows correspond to random samples. In the learning task, an unknown concept $x\in X$ is sampled uniformly at random and each random sample is given as $(a_{i},b_{i})=(a_{i},M(a_{i},x))$ for a uniformly picked $a_{i}\in A$ . The learner’s goal is to uncover $x$ . In [GRT18], it is proved that when the underlying matrix $M$ is a $(k,\ell)$ - $L_{2}$ two source extractor¹¹1Roughly speaking, this means that every submatrix $M^{\prime}$ of $M$ with number of rows at least $2^{-k}|A|$ and number of columns at least $2^{-\ell}|X|$ has a relative bias at most $2^{-r}$ . with error $2^{-r}$ , a learning algorithm requires either $\Omega(k\cdot\ell)$ bits of memory or $2^{\Omega(r)}$ samples to achieve a success probability at least $2^{-O(r)}$ for the learning task.

1.1 Our Results

In this work, we model a quantum learning algorithm as a program with hybrid memory consisting of $q$ qubits of quantum memory and $m$ bits of classical memory. At each stage, a random sample $(a_{i},b_{i}=M(a_{i},x))$ is given to the algorithm. The quantum learning algorithm applies an arbitrary quantum channel to the hybrid memory, controlled by the random sample. Although the channel can be arbitrary, we impose the outcome to be a hybrid classical-quantum state of at most $q$ qubits and $m$ bits. We stress that there is no limitation on the complexity of the quantum channel (and this only makes our results stronger as we are proving here lower bounds for such algorithms).

With the above model, we give the following main theorem.

Theorem 1 (Main Theorem, Informal).

Let $M:A\times X\to\{-1,1\}$ be a matrix. If $M$ is a $(k,\ell)$ - $L_{2}$ two source extractor with error $2^{-r}$ , a quantum learning algorithm requires either

1.

$\Omega(k\cdot\ell)$ bits of classical memory; or,
2.

$\Omega(r)$ qubits of quantum memory; or,
3.

$2^{\Omega(r)}$ samples,

to succeed with a probability of at least $2^{-O(r)}$ in the corresponding learning task.

Our main theorem implies that for many learning problems, the availability of a quantum memory of size up to $\Omega(r)$ , does not reduce the size of classical memory or the number of samples that are needed. As coherent quantum memory is challenging for near-term intermediate-scale quantum computers and is probably expensive even if and when quantum computers are widely viable, the impact of quantum memory is further limited for these learning problems.

To make the theorem more precise, let us take parity learning as an example. The above theorem says that a quantum learning algorithm needs either $\Omega(n^{2})$ bits of memory, or $\Omega(n)$ qubits of quantum memory, to conduct efficient learning; otherwise, it requires $2^{\Omega(n)}$ random samples. At first glance, it seems that the constraint on quantum memory is trivial: if the target is to learn an $n$ -bit unknown secret, a linear amount of memory always seems necessary to store the secret. However, noticing that our main theorem applies to quantum learning algorithms with hybrid memory and rules out algorithms with $n^{2}/1000$ bits and $n/1000$ qubits of hybrid memory for parity learning, the main theorem yields non-trivial and compelling memory-sample lower bounds. Note also that our results (and previous results) are valid even if the goal is to output only one bit of the secret. Currently, we do not know whether our main theorem is tight. For parity learning, we are not aware of any quantum learning algorithm that uses only $O(n)$ qubits of quantum memory. We leave closing the gap as a fascinating open question.

The main theorem naturally applies to other learning problems considered in [GRT18], including learning sparse parities, learning from sparse linear equations, and many others. We do not present an exhaustive list here but refer the readers to [GRT18] for more details.

Along the way, we propose a new approach for proving the classical memory-sample lower bounds. We call this approach, the “badness levels” method. The approach is technically equivalent to the previous approach in [Raz17, GRT18] but is conceptually simpler to work with and we are able to lift it to the quantum case.

We note that proving a linear lower bound on the size of the quantum memory, without classical memory, is significantly simpler (but to the best of our knowledge such a proof has not appeared prior to our work). We present such a proof in Appendix C. In Appendix C, we state and prove Theorem 3 that shows a simpler proof for a linear lower bound on the quantum-memory size (without classical memory). While Theorem 3 is qualitatively weaker than our main result in most cases, as it only gives a lower bound for programs with only quantum memory but without a (possibly quadratic) classical memory, Theorem 3 is technically incomparable and is stated in terms of quantum extractors, rather than classical extractors. Additionally, the proof of Theorem 3 is significantly simpler than the proof of our main theorem.

Implications to Cryptography in the Bounded-Storage Model.

Since learning theory and cryptography can be viewed as two sides of the same coin, our theorem also lifts the security of many existing cryptographical protocols in the bounded-storage model (protocols that are based on parity learning) to the quantum setting. To our best knowledge, these are the first proofs of classical cryptographical protocols being secure against space-bounded quantum attackers.²²2On the other hand, there are known examples of classically-secure bounded-storage protocols that are breakable with an exponentially smaller amount of quantum memory. [GKK⁺08]. We elaborate more below.

Cryptography in the (classical) bounded storage model was first proposed by Maurer [Mau92]. In such a model, no computational assumption is needed. Honest execution is performed through a long streaming sequence of bits. Eavesdroppers have bounded storage and limited capability of storing conversations, thus cannot break the protocol. A line of works [CM97, AR99, ADR02, DR02, DM02, Lu02, DM04, MSTS04, HN06, DQW21, DQW22, …] builds efficient and secure protocols for key agreement, oblivious transfer, bit commitment and time stamping in that model.

Based on the memory-sample lower bounds for parity learning of $n$ bits, [Raz18] suggested an encryption scheme in the bounded-storage model. Guan and Zhandry [GZ19] proposed key agreement, oblivious transfer and bit commitment with improved rounds and better correctness, against attackers with up to $O(n^{2})$ bits of memory. Following a similar idea, Liu and Vusirikala [LV21] showed that semi-honest multiparty computation could be achieved against attackers with up to $O(n^{2})$ bits of memory. More recently, Dodis, Quach, and Wichs [DQW22] considered message authentication in the bounded storage model based on parity learning. Our result on parity learning gives a direct lift on all the results above. When the cryptographic protocols are based on parity learning of $n$ bits (often treated as a security parameter), our result shows that security holds even in the presence of a quantum adversary with at most $O(n^{2})$ bits of classical memory and $O(n)$ qubits of quantum memory.

Despite many previous works on cryptography in the quantum bounded storage model [DFR⁺07, DFSS07, Sch07, DFSS08, WW08, PMLA13, BY21], they all rely on streaming quantum states. Our memory-sample lower bounds give for the first time a rich class of classical cryptographical schemes (key agreement, oblivious transfer, and bit commitment) secure against space-bounded quantum attackers.

2 Proof Overview

2.1 Recap of Proofs for Classical Lower Bounds

Since our proof builds on the previous line of works on classical memory-sample lower bounds for learning, specifically, on the proof technique of [Raz17, GRT18], we provide a brief review of these proofs, using parity learning [Raz18] as an example. In below, $M(a,x)$ denotes the inner product of $a$ and $x$ in $\mathbb{F}_{2}$ .

Consider a classical branching program that tries to learn an unknown and uniformly random $x\in\{0,1\}^{n}$ from samples $(a,b)$ , where $a\in\{0,1\}^{n}$ is uniformly random and $b=M(a,x)$ . We can associate every state $v$ of the branching program with a distribution $P_{X|v}$ over $\{0,1\}^{n}$ , indicating the distribution of $x$ conditioned on reaching that state. At the initial state, without any information about $x$ , the distribution is uniform (which has the smallest possible $\ell_{2}$ -norm). Along a computational path on the branching program, the distribution $P_{X|v}$ evolves and eventually gets concentrated (with large $\ell_{2}$ -norms) in order to output $x$ correctly. Therefore, during the evolution, $P_{X|v}$ should at some stage have mildly large $\ell_{2}$ -norms ( $2^{\varepsilon n}$ times larger than uniform for some small constant $\varepsilon>0$ ). If we set such a distribution as a target, the distribution is hard to achieve with random samples. Only with $2^{-\Omega(n)}$ probability, the branching program can make significant progress towards the target; while most of the time a sample just splits the distributions (both the current and the target distribution) into two even parts, and that does not help much in getting closer to the target distribution (with large $\ell_{2}$ norm).

To put it more rigorously, we examine the evolution of the inner product

\langle P_{X|v},P\rangle=\sum_{x\in\{0,1\}^{n}}P_{X|v}(x)\cdot P(x)

between the distribution $P_{X|v}$ on the current state $v$ , and a target distribution $P$ . Receiving a sample $(a,b)$ implies that $M(a,x)=b$ , hence only the part of $P_{X|v}$ supported on such $x$ proceeds. If this part is close to $\frac{1}{2}$ probability, we say that $a$ divides $P_{X|v}$ evenly. Denoting the new distribution as $P_{X|v}^{(a,b)}$ , after proper normalization the new inner product is

\langle P_{X|v}^{(a,b)},P\rangle=\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x)\cdot P(x)\bigg{/}\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x).

(1)

Ideally, both $P_{X|v}$ and the point-wise product vector $P_{X|v}\cdot P$ should have reasonably small $\ell_{2}$ -norms. Due to the extractor property of $M$ , most of $a\in\{0,1\}^{n}$ should divide both vectors evenly, and thus the denominator is close to $\frac{1}{2}$ while the enumerator is close to $\frac{1}{2}\langle P_{X|v},P\rangle$ . That means, given a uniformly random $a$ , we get limited progress on the inner product. On the other hand, from $\langle U,P\rangle=2^{-n}$ with uniform distribution $U$ to $\langle P,P\rangle=2^{2\varepsilon n}\cdot 2^{-n}$ , the branching program needs to make multiple steps of progression. Therefore it happens with an extremely small probability.

To ensure that the above statement goes smoothly, we require the following properties for every state $v$ in the branching program:

•

The $\ell_{2}$ -norm $\big{\|}P_{X|v}\big{\|}_{2}$ is small.
•

The $\ell_{2}$ -norm $\big{\|}P_{X|v}\cdot P\big{\|}_{2}$ is small, which is implied when the $\ell_{\infty}$ -norm $\big{\|}P_{X|v}\big{\|}_{\infty}$ is small.
•

The denominator in Eq. 1 is bounded away from $0$ for every sample $(a,b)$ .

These properties do not hold by themselves. Instead, we execute a truncation procedure on the branching program before choosing a target distribution. More specifically, the branching program is modified so that it stops whenever it:

•

( $\ell_{2}$ truncation): Reaches a state $v$ with large $\big{\|}P_{X|v}\big{\|}_{2}$ ;
•

( $\ell_{\infty}$ truncation): Reaches a state $v$ with large $P_{X|v}(x)$ when the unknown concept is $x$ ;
•

(Sample truncation): Or, for the next sample $(a,b)$ , $a$ does not divide $P_{X|v}$ evenly.

It turns out that after $\ell_{2}$ truncation, the other two truncation steps add $2^{-\Omega(n)}$ error in each stage of the branching program. Therefore the proof boils down to proving a $2^{-\Omega(n^{2})}$ bound on the probability of reaching a state with large $\big{\|}P_{X|v}\big{\|}_{2}$ , from which by a standard union bound, we can prove the memory-sample lower bounds for parity learning: either $2^{\Omega(n)}$ samples or $\Omega(n^{2})$ bits of memory are necessary.

2.2 Badness Levels

As mentioned above, to bound the probability of reaching a state with a large $\ell_{2}$ -norm, the basic idea is to fix its distribution as the target distribution $P$ , and bound the increment of the inner product $\langle P_{X|v},P\rangle$ . This was done in [Raz17, GRT18] by designing a potential function that tracks the average of $\langle P_{X|v},P\rangle^{k}$ for some $k=\Theta(n)$ , where the average is over states $v$ in the same stage of the branching program. Here we propose another approach using the concept of badness levels. Although it is technically equivalent to the potential function approach in the classical case, it is more pliable and easier to be adapted to the quantum case. We view this approach as a separate contribution of our work.

We first define a bad event to be a pair $(v,a)$ of the state $v$ and the upcoming part of the sample $a$ , such that $\langle P_{X|v},P\rangle\geq 2^{-n}$ , and for one of the two possible outcomes $b$ ,

\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x)\cdot P(x)\geq\left(\frac{1}{2}+2^{-\delta n}\right)\cdot\langle P_{X|v},P\rangle

(2)

with some small constant $\delta$ . In other words, the inner product $\langle P_{X|v},P\rangle$ is large enough, while not being divided evenly by $a$ . From Eq. 1 we know that the inner product gets at most roughly doubled through a bad event. In contrast, in the good case, the inner product either gets a mere $(1+2^{-\delta n})$ multiplicative factor or is already smaller than the baseline $2^{-n}$ . Also, the extractor property of $M$ ensures that for every state $v$ , over uniformly random $a$ , the bad event happens with at most $2^{-\Omega(n)}$ probability.

Now, the badness level $\beta(v)$ of a state $v$ keeps track of how many times the computational path went through bad events before reaching $v$ .³³3For now we think of $\beta(v)$ as a natural number. In the actual proof, $\beta(v)$ is a distribution on natural numbers, as for different computational paths reaching the same state, the count of bad events can be different. The above observations on the bad events imply that (omitting the smaller factors):

•

For every state $v$ , $\langle P_{X|v},P\rangle$ is bounded by $2^{\beta(v)}\cdot 2^{-n}$ ;
•

Heading to the next stage, $\beta(v)$ increases by $1$ with probability $2^{-\Omega(n)}$ .

Therefore at each stage, the total weight of states with badness level $\beta$ is at most $2^{-\Omega(\beta n)}$ . Thus any state with $\langle P_{X|v},P\rangle\geq 2^{2\varepsilon n}\cdot 2^{-n}$ must have $2^{-\Omega(n^{2})}$ probability.

2.3 Obstacles for Proving Quantum Lower Bounds

In this section, we present an attempt to prove the same $2^{\Omega(n)}$ -sample or $\Omega(n^{2})$ -quantum-memory lower bound for the pure quantum case. Along the way we identify some obstacles to proving memory-sample lower bounds for quantum learning algorithms, and in the next section we show how to overcome these obstacles while proving lower bounds for hybrid learning algorithms, with quadratic-size classical-memory and linear-size quantum-memory.

Following the same framework as the above described proof for the classical case, we first need to transfer all the notions to a quantum algorithms:

•

The state $v$ is a quantum state in the Hilbert space of quantum memory;
•

The distribution $P_{X|v}$ is still well-defined: It is the distribution of $x$ when the quantum memory is measured to $v$ (see Section 3.4 and Eq. 3);
•

We are still able to implement $\ell_{2}$ truncation: If $P_{X|v}$ has large $\ell_{2}$ -norm, project the entire system to the orthogonal subspace $v^{\perp}$ of $v$ and repeat, until there is no such state $v$ (see Section 4.1 for details).
•

We are also able to implement sample truncation, in a similar manner to $\ell_{2}$ truncation. As the criteria here depends on $a$ , we separately create a copy of the current system for each $a$ , truncate the states $v$ using projection when $P_{X|v}$ is not evenly divided by $a$ in each copy, and then merge them back together. We prove that the error introduced by this truncation is small.

Here comes the first major obstacle: $\ell_{\infty}$ truncation. In the classical case, $\ell_{\infty}$ truncation is implemented for each individual $x$ , in contrast to $\ell_{2}$ truncation where the states are removed altogether. Relying on the fact that it is already known that the $\ell_{2}$ norm of the distribution is small, using Markov inequality, one can prove that the error introduced by the $\ell_{\infty}$ truncation is small.

However, when we try to emulate the classical implementation of $\ell_{\infty}$ truncation with quantum truncation, that is, to only project to $v^{\perp}$ the system conditioned on the specific $x$ where $P_{X|v}(x)$ is large, instead of for every $x$ , it may lead to huge changes to the distributions $P_{X|u}$ on states $u$ non-orthogonal to $v$ . The following example illustrates such a scenario:

Example.

Consider a quantum learning algorithm, and assume that at some stage of the computation, for each $x\in\{0,1\}^{n}$ , the quantum memory is in some pure state $v(x)$ . We pick each $v(x)$ uniformly at random in a Hilbert space of dimension $d\approx 2^{n/2}$ and consider a typical configuration of $v(x)$ . Now the $\ell_{2}$ -norms are bounded for every quantum state $v$ : the worst ones happen when $v=v(x)$ for some $x$ , where $\|P_{X|v(x)}\|_{2}$ is typically around $d\cdot 2^{-n}$ , close to the $\ell_{2}$ -norm of uniform distribution. However, those worst distributions also have $\ell_{\infty}$ -norms close to $d\cdot 2^{-n}$ , which is much larger than the $\ell_{\infty}$ -norm of the uniform distribution, and needs to be truncated. But truncating $v(x)$ off for $x$ means that $x$ is completely erased, and we end up removing everything.

Moving on, we fix a target state $v$ with a target distribution $P_{X|v}$ which exceeds the $\ell_{2}$ -norm threshold, and the goal is again to prove a $2^{-\Omega(n^{2})}$ amplitude bound on $v$ . The bad event should still be defined as a pair $(v,a)$ satisfying Eq. 2, with $v$ now being a quantum state. We then run into the second major obstacle: it is not clear how to define badness levels.

If we define the badness level $\beta(v)$ for each state $v$ individually by examining the bad events over the historical states, then it is not clear how to measure the total weight of a badness level $\beta$ . In the classical case, we simply define the total weight as the total probability of states with badness level $\beta$ . But here in the quantum case, it turns out that such a definition either depends on the choice of basis, which might have large increment in each stage, or completely fails to imply the desired amplitude bound on the target state.

The other choice is to have a more operational definition of badness levels, and it is indeed tempting to define $\beta$ as another register whose updates are controlled by the quantum memory. The problem with such definitions is that the bad event (Eq. 2) is not linear in $v$ . Therefore an operational definition of badness level, which is a linear operator, inevitably introduces error that escalates fast with the number of stages.

2.4 Hybrid Memory Lower Bounds with Small Quantum Memory

The obstacles in the previous section are for proving quadratic quantum memory lower bound. We note that proving linear quantum memory lower bound (without classical memory) is not hard: the proof can be entirely information theoretical, as with very limited memory, say, $\frac{1}{2}n$ qubits, the information gained from each sample is exponentially small, despite the memory being quantum. We present such a proof in Appendix C.

The lower bounds that we prove here are with hybrid memory: To learn parity with both classical and quantum memory, an algorithm needs either $2^{\Omega(n)}$ samples, or $\Omega(n^{2})$ classical memory, or $\Omega(n)$ quantum memory (Theorem 1). We now describe how we overcome the previously mentioned obstacles.

$\ell_{\infty}$ Truncation.

When there is only small quantum memory and no classical memory, the treatment for $\ell_{\infty}$ truncation is straightforward. We remove all quantum states $v$ with distributions of large $\ell_{\infty}$ -norm, by projecting the system to the orthogonal subspace $v^{\perp}$ , just like the process of $\ell_{2}$ truncation. As the overall distribution on $x$ is uniform, any state $v$ with $\|P_{X|v}\|_{\infty}\geq 2^{\delta n}\cdot 2^{-n}$ must have weight at most $2^{-\delta n}$ . Therefore, as long as the dimension of the Hilbert space is much smaller than $\delta n$ , the error introduced in this truncation is small. ⁴⁴4The example in the previous section that shows the infeasibility of treating $\ell_{\infty}$ truncation the same way as $\ell_{2}$ truncation does not work here, as it requires $n/2$ qubits of memory while here we have a smaller memory size.

With classical memory in presence, the actual $\ell_{\infty}$ truncation step (see Section 4.2, Step 2) is more complicated. We first apply the original classical $\ell_{\infty}$ truncation on the classical memory $W$ . Now that $\|P_{X|w}\|_{\infty}$ is bounded for each classical memory state $w$ , we can remove the quantum states $v$ with large $\|P_{X|v,w}\|_{\infty}$ by projection as stated above. Since the classical $\ell_{\infty}$ truncation depends on $x$ , it could change the distributions $P_{X|v,w}$ . However, as in the classical case, $P_{X|w}$ will not change a lot. Thus, wherever $P_{X|v,w}$ changes drastically, it must have a small weight and can also be removed by projection. This removal corresponds to truncation by $G_{t}$ in Section 4.2.

Badness Levels.

Interestingly, we are able to avoid the problems of defining the badness level on quantum memory altogether, by keeping it a property on the classical memory only. To do so we need to alter the definition of a bad event: it is now a pair $(w,a)$ of classical memory state $w$ and sample $a$ , such that there exists some quantum memory state $v$ with $P_{X|v,w}$ satisfying Eq. 2.

For each fixed classical memory state $w$ , we still need to ensure that bad events happen with a small probability. We prove it (Lemma 5.2) by showing that, if there are many different samples $a$ , each associated with some quantum state $v_{a}$ satisfying Eq. 2, then there is some quantum state $v$ that simultaneously satisfies Eq. 2 with most of such $a$ (which is impossible because of the extractor property). This is ultimately due to the continuous nature of Eq. 2: Under some proper congruent transformation, Eq. 2 becomes a simple threshold inequality on quadratic forms over $v$ . Now if it is satisfied by some $v_{a}$ , it is going to be satisfied by most $v$ for a much smaller threshold parameter $\delta$ , and hence the existence of a simultaneously satisfying $v$ .⁵⁵5We note that the error bound for sample truncation (Lemma 4.12) is also proved using this argument. In this argument, we use Lemma 3.1, which is derived from the anti-concentration bound for Gaussian quadratic forms, and crucially relies on the fact that the dimension is at most $2^{\varepsilon n}$ for some small $\varepsilon$ .

Another technical problem is that to use the extractor property, we need to ensure that $\langle P_{X|v,w},P\rangle\geq 2^{-n}$ for the simultaneously satisfying $v$ . Thus, what we do in Lemma 5.2 is to first conceptually remove the parts where $\langle P_{X|v,w},P\rangle$ is too small, using projection similarly to the truncation steps. After the removal, we are left with a subspace $\mathcal{V}^{\prime}$ where $\langle P_{X|v,w},P\rangle$ is always lower bounded, and we show that for every state $v$ that satisfies Eq. 2, the inequality is still close to being satisfied after projecting $v$ onto $\mathcal{V}^{\prime}$ . Therefore we could still apply the above argument and find a simultaneously satisfying $v$ within the subspace.

3 Preliminaries

3.1 Vectors and Matrices

For a vector $v\in\mathbb{C}^{d}$ and $p\in[1,\infty]$ , we define the $\ell_{p}$ norm of $v$ as

\|v\|_{p}=\left(\sum_{i=1}^{d}|v_{i}|^{p}\right)^{1/p}.

For two vectors $u,v\in\mathbb{C}^{d}$ , define their inner product as $\langle u,v\rangle=u^{\dagger}v=\sum_{i=1}^{d}\overline{u_{i}}v_{i}$ . So $\|v\|_{2}^{2}=\langle v,v\rangle$ . We also view every distribution $P$ over a set $\mathcal{X}$ as a non-negative real vector with $\|P\|_{1}=1$ .

We specifically use Dirac notation to denote unit vectors, $|v\rangle\in\mathbb{C}^{d}$ implies that $\||v\rangle\|_{2}=1$ . For a non-zero vector $u\in\mathbb{C}^{d}$ , let $|v\rangle\sim u$ be the normalization of $u$ , that is, $|v\rangle=u/\|u\|_{2}$ .

For every vector $v\in\mathbb{C}^{d}$ , let $\operatorname{Diag}v\in\mathbb{C}^{d\times d}$ be the diagonal matrix whose diagonal entries represent $v$ . Conversely, for every square matrix $M$ , let $\operatorname{diag}M$ be the vector consisting of the diagonal entries of $M$ . For a matrix (or generally a linear operator) $M$ , we use $\|M\|_{\mathrm{Tr}}$ and $\|M\|_{2}$ to denote its trace norm and spectral norm respectively, that is,

\big{\|}M\big{\|}_{\mathrm{Tr}}=\mathrm{Tr}\left[\sqrt{MM^{\dagger}}\right],\quad\big{\|}M\big{\|}_{2}=\max_{v\neq 0}\|Mv\|_{2}/\|v\|_{2}.

For an Hermitian $M\in\mathbb{C}^{d\times d}$ , we say it is a positive semi-definite operator if for every $v\in\mathbb{C}^{d}$ , $v^{\dagger}Mv\geq 0$ . A (partial) density operator is a positive semi-definite operator with its trace being $1$ (or at most $1$ , respectively).

Viewing a Learning Problem as a Matrix

Let $M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}$ be a matrix. The matrix $M$ corresponds to the following learning problem. There is an unknown element $x\in\mathcal{X}$ that was chosen uniformly at random. A learner tries to learn $x$ from samples $(a,b)$ , where $a\in\mathcal{A}$ is chosen uniformly at random and $b=M(a,x)$ . That is, the learning algorithm is given a stream of samples, $(a_{1},b_{1}),(a_{2},b_{2}),\ldots$ , where each $a_{t}$ is uniformly distributed and for every $t$ , $b_{t}=M(a_{t},x)$ . For each $a\in\mathcal{A}$ , we use $M_{a}:\mathcal{X}\rightarrow\{-1,1\}$ to denote the vector corresponding to the $a$ -th row of $M$ .

Extractors

A matrix $M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}$ with $n=\log_{2}|\mathcal{X}|$ is a $(k,\ell)$ - $L_{2}$ extractor with error $2^{-r}$ , if for every distribution $P$ over $\mathcal{X}$ with $\|P\|_{2}\leq 2^{\ell}\cdot 2^{-n/2}$ , there are at most $2^{-k}\cdot|\mathcal{A}|$ rows $a\in\mathcal{A}$ such that

|\langle M_{a},P\rangle|\geq 2^{-r}.

3.2 Anti-Concentration Bound for Quadratic Form on Unit Vectors

Lemma 3.1.

There exists an absolute constant $c$ such that following holds. Let $\sigma$ be a Hermitian operator over the Hilbert space $\mathcal{V}=\mathbb{C}^{d}$ , and let $v$ be a uniformly random unit vector in $\mathcal{V}$ . Then for every $\varepsilon>0$ , we have

\Pr\left[|v^{\dagger}\sigma v|\leq\frac{\varepsilon\|\sigma\|_{2}}{d}\right]\leq c\sqrt{\varepsilon}+e^{-d}.

Proof.

Let $g=(g_{1},\ldots,g_{d})\sim\mathcal{N}(0,1)^{d}$ be standard Gaussians. Notice that $\|g\|_{2}^{2}$ follows $\chi_{d}^{2}$ -distribution, and $|g^{\dagger}\sigma g|/\|g\|_{2}^{2}$ is equidistributed as $|v^{\dagger}\sigma v|$ . Therefore by union bound we have

\Pr_{v}\left[|v^{\dagger}\sigma v|\leq\frac{\varepsilon\|\sigma\|_{2}}{d}\right]=\Pr_{g}\left[|g^{\dagger}\sigma g|\leq{\varepsilon\|\sigma\|_{2}}\cdot\frac{\|g\|^{2}_{2}}{d}\right]\leq\Pr_{g}\left[|g^{\dagger}\sigma g|\leq 5\varepsilon\|\sigma\|_{2}\right]+\Pr_{g}\Big{[}\|g\|_{2}^{2}\geq 5d\Big{]}.

For the first term, notice that $\mathrm{Var}[g^{\dagger}\sigma g]=2\mathrm{Tr}[\sigma^{2}]$ (see e.g. [RS08, Chapter 5]) which is no smaller than $2\|\sigma\|_{2}^{2}$ . Therefore, by Carbery–Wright inequality [CW01], there exists an absolute constant $c$ such that

\Pr\left[|g^{\dagger}\sigma g|\leq 5\varepsilon\|\sigma\|_{2}\right]\leq\Pr\left[|g^{\dagger}\sigma g|\leq 4\varepsilon\mathrm{Var}[g^{\dagger}\sigma g]^{1/2}\right]\leq c\sqrt{\varepsilon}.

For the second term, the standard Laurent-Massart bound on $\chi^{2}$ -distrbitions [LM00] gives:

\Pr\Big{[}\|g\|_{2}^{2}\geq 5d\Big{]}\leq e^{-d}.\qed

3.3 Multipartite Quantum Systems

The state of $q$ qubits can be represented in a Hilbert space $\mathcal{V}=(\mathbb{C}^{2})^{\otimes q}=\mathbb{C}^{2^{q}}$ . In a product of $m$ Hilbert spaces $\mathcal{V}_{[m]}=\mathcal{V}_{1}\otimes\cdots\otimes\mathcal{V}_{m}$ , a multipartite partial system $V_{1},\ldots,V_{m}$ is represented by a partial density operator $\rho_{V_{[m]}}$ . For a subset $I\subseteq[m]$ of indices, the subsystem on $\{V_{i}\}_{i\in I}$ (or $V_{I}$ for short) is defined by tracing out $j\notin I$ , that is,

\rho_{V_{I}}=\mathrm{Tr}_{V_{j\notin I}}[\rho_{V_{[m]}}].

Now for any two disjoint subsets $I,J\subset[m]$ , given some $|v_{J}\rangle\in\mathcal{V}_{J}=\bigotimes_{j\in J}\mathcal{V}_{j}$ , the conditional system on $V_{I}$ is defined as

\rho_{V_{I}|v_{J}}=\left(\mathbb{I}_{V_{I}}\otimes\langle v_{J}|\right)\rho_{V_{I\cup J}}\left(\mathbb{I}_{V_{I}}\otimes|v_{J}\rangle\right),

which is a partial density operator on $V_{I}$ . Note that the trace

\mathrm{Tr}\left[\rho_{V_{I}|v_{J}}\right]=\langle v_{J}|\rho_{V_{J}}|v_{J}\rangle

only depends on the system $\rho$ and $|v_{J}\rangle$ , while being independent of the choice of $I$ .

Another simple fact that will be repeatedly used later on is that for an orthogonal basis $\mathcal{B}$ of $V_{J}$ , we have

\rho_{V_{I}}=\mathrm{Tr}_{V_{J}}[\rho_{V_{I\cup J}}]=\sum_{|v_{J}\rangle\in\mathcal{B}}\rho_{V_{I}|v_{J}}.

3.4 Classical-Quantum Systems

In the underlying space $\mathcal{V}_{1}\otimes\cdots\otimes\mathcal{V}_{m}$ of the multipartite system, we say $\mathcal{V}_{i}$ is classical if there is a fixed orthogonal basis $\mathcal{B}_{i}$ of $\mathcal{V}_{i}$ , such that for every multipartite system $\rho_{V_{[m]}}$ , every pair of distinct $|v_{i}\rangle\neq|v_{i}^{\prime}\rangle\in\mathcal{B}_{i}$ and every two states $|v\rangle,|v^{\prime}\rangle\in\bigotimes_{j\neq i}\mathcal{V}_{j}$ , we have

\langle v_{i},v|\rho_{V_{[m]}}|v_{i}^{\prime},v^{\prime}\rangle=0.

Without loss of generality, in the rest of the work we always assume $\mathcal{B}_{i}$ is the set of computational basis states. We also identify $\mathcal{V}_{i}$ with the discrete set $\mathcal{B}_{i}$ , and remove the Dirac brackets when we talk about the classical elements in $\mathcal{V}_{i}$ . In this case every multipartite system $\rho_{V_{[m]}}$ can be written as a direct sum

\rho_{V_{[m]}}=\bigoplus_{v_{i}\in\mathcal{V}_{i}}\rho_{V_{[m]\setminus\{i\}}|v_{i}}.

The reader may find this direct sum viewpoint easier to handle in some later scenarios.

When $V_{I}$ is classical, conditioned on any $|v_{J}\rangle\in\mathcal{V}_{J}$ with $J$ disjoint from $I$ , the system $\rho_{V_{I}|v_{J}}$ is represented as a diagonal matrix on $V_{I}$ . If $\mathrm{Tr}[\rho_{V_{I}|v_{J}}]>0$ , it induces a distribution over the computation basis states of $V_{I}$ , defined as

\displaystyle P^{\rho}_{V_{I}|v_{J}}=\operatorname{diag}\rho_{V_{I}|v_{J}}/\mathrm{Tr}[\rho_{V_{I}|v_{J}}].

(3)

In the rest of this paper, whenever we use this notation $P^{\rho}_{V_{I}|v_{J}}$ , it is always implicitly assumed that $\mathrm{Tr}[\rho_{V_{I}|v_{J}}]>0$ and the distribution exists.

In this work we typically consider the following scenario: There is a quantum memory register $V$ ranging in the Hilbert space $\mathcal{V}$ , and a classical memory register $W$ ranging in the set of memory states $\mathcal{W}$ , along with some classical information $X\in\mathcal{X}$ (later in the work, it is the concept to be learned) that is correlated with $V$ and $W$ . We will make use of the following fact:

Claim 3.2.

Let $\rho_{XVW}$ be a classical-quantum system over classical $X,W$ and quantum $V$ . For every $w\in\mathcal{W}$ , $P^{\rho}_{X|w}$ is a convex combination of $P^{\rho}_{X|v,w}$ for some $\{|v\rangle\}\subseteq\mathcal{V}$ .

Proof.

Let $\mathcal{B}$ be an orthogonal basis of $\mathcal{V}$ , so that we have (from the end of last section)

\rho_{X|w}=\sum_{|v\rangle\in\mathcal{B}}\rho_{X|v,w}.

Therefore $P^{\rho}_{X|w}$ is a linear combination of $P^{\rho}_{X|v,w}$ for $|v\rangle\in\mathcal{B}$ , with non-negative coefficients. Since they are all distributions, it must be a convex combination. ∎

Characterization of operators over classical-quantum hybrid systems.

Now we identify all possible operators on the classical-quantum hybrid memory space $\mathcal{V}\otimes\mathcal{W}$ . A priori to the assumption that $W$ is classical, we think of a quantum channel operating on the system as working on the underlying space $\mathcal{V}\otimes\mathbb{C}^{|\mathcal{W}|}$ . Now we denote $\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ to be the set of all such quantum channels $\Phi$ that satisfy the following: for every classical-quantum system $\rho_{VW}$ in $\mathcal{V}\otimes\mathcal{W}$ , $W$ is still classical in $\Phi(\rho_{VW})$ . That is, for every two states $|v\rangle,|v^{\prime}\rangle\in\mathcal{V}$ and every pair of distinct $w,w^{\prime}\in\mathcal{W}$ , we have

\langle v,w|\Phi(\rho_{VW})|v^{\prime},w^{\prime}\rangle=0.

Note that not all channels in $\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ are physically realizable. For instance, with one-bit classical memory and no quantum memory, the channel

\begin{pmatrix}a&c\\ \overline{c}&b\end{pmatrix}\mapsto\begin{pmatrix}a&ic\\ -i\overline{c}&b\end{pmatrix}

is not a classical operator. However, since we are constrained to classical quantum systems, this channel is effectively equivalent to an identity channel on one-bit classical memory. Generally speaking, every channel in $\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ is equivalent to a channel controlled by $\mathcal{W}$ that maps $\mathcal{V}$ to $\mathcal{V}\otimes\mathcal{W}$ . Below, we prove this observation and use it to show the following claim:

Claim 3.3.

Let $\rho_{XVW}$ be a classical-quantum system over classical $X,W$ and quantum $V$ . Let $\Phi\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ , and we use $\Phi(\rho)$ to denote the system after applying $\Phi$ to $VW$ and identity to $X$ . Then for every $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ , $P^{\Phi(\rho)}_{X|v,w}$ is a convex combination of $P^{\rho}_{X|v^{\prime},w^{\prime}}$ for some $\{|v^{\prime}\rangle\}\subseteq\mathcal{V}$ and $\{w^{\prime}\}\subseteq\mathcal{W}$ .

One difference between 3.2 and 3.3 is that in 3.3 it is not always possible to write $P^{\Phi(\rho)}_{X|v,w}$ as a convex combination of $P^{\rho}_{X|v^{\prime},w^{\prime}}$ for $|v^{\prime}\rangle$ from an orthogonal basis of $\mathcal{V}$ . But it is always possible in 3.2. Although the difference does not matter in this work, we mention it here for clarity.

Proof.

Since $\Phi\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ , the following channel is functionally equivalent to $\Phi$ for classical-quantum systems:

\displaystyle\Phi^{\prime}:\rho\to\sum_{w\in\mathcal{W}}\Phi(\rho_{V|w}\otimes|w\rangle\langle w|).

The physical meaning of $\Phi^{\prime}$ is to measure $W$ under the computational basis (which should not change the functionality we care about) and apply $\Phi$ .

By defining the channel $\Phi_{w}(\cdot):=\Phi(\cdot\otimes|w\rangle\langle w|)$ , the above can be alternatively written as:

\displaystyle\Phi^{\prime}:\rho\to\sum_{w\in\mathcal{W}}\Phi_{w}(\rho_{V|w}).

Now consider the Kraus representation of each $\Phi_{w}$ , that is, a finite set of linear operators $E_{w,k}:\mathcal{V}\rightarrow\mathcal{V}\otimes\mathcal{W}$ such that

\Phi_{w}(\rho_{V|w})=\sum_{k}E_{w,k}\rho_{V|w}E_{w,k}^{\dagger},\qquad\sum_{k}E_{w,k}^{\dagger}E_{w,k}=\mathbb{I}_{V}.

We can write

	$\displaystyle\Phi(\rho)_{X\|v,w}=\Phi^{\prime}(\rho)_{X\|v,w}$	$\displaystyle=(\mathbb{I}_{X}\otimes\langle v,w\|)\Phi^{\prime}(\rho)(\mathbb{I}_{X}\otimes\|v,w\rangle)$
		$\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}(\mathbb{I}_{X}\otimes\langle v,w\|E_{w^{\prime},k})\rho_{XV\|w^{\prime}}(\mathbb{I}_{X}\otimes E_{w^{\prime},k}^{\dagger}\|v,w\rangle)$
		$\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}\big{\\|}E_{w^{\prime},k}^{\dagger}\|v,w\rangle\big{\\|}_{2}\cdot\rho_{X\|v^{\prime},w^{\prime}}$

where in each term of the summation, $|v^{\prime}\rangle\sim E_{w^{\prime},k}^{\dagger}|v,w\rangle$ . Similar to the arguments in 3.2, $P^{\Phi(\rho)}_{X|v,w}$ is a convex combination of $P^{\rho}_{X|v^{\prime},w^{\prime}}$ . ∎

3.5 Branching Program with Hybrid Memory

For a learning problem that corresponds to the matrix $M$ , a branching program of hybrid memory with $m$ -bit classical memory, $q$ -qubit quantum memory and length $T$ is specified as follows.

At each stage $0\leq t\leq T$ , the memory state of the branching program is described as a classical-quantum system $\rho^{(t)}_{VW}$ over quantum memory space $\mathcal{V}=(\mathbb{C}^{2})^{\otimes q}$ and classical memory space $\mathcal{W}=\{0,1\}^{m}$ . The memory state evolves based on the samples that the branching program receives, and therefore depends on the unknown element $x\in_{R}\mathcal{X}$ . We can then interpret the overall systems over $XVW$ , in which ${X}$ consists of an unknown concept $x$ , resulting in a classical-quantum system $\rho^{(t)}_{XVW}$ . It always holds that the distribution of $x$ is uniform, i.e.,

\rho^{(t)}_{X}=\mathrm{Tr}_{VW}[\rho^{(t)}_{XVW}]=\frac{1}{2^{n}}\mathbb{I}_{X}.

Initially the memory $VW$ is independent of $X$ and can be arbitrarily initialized. We assume that

\rho^{(0)}_{XVW}=\frac{1}{2^{n}}\mathbb{I}_{X}\otimes\frac{1}{2^{q}}\mathbb{I}_{V}\otimes\frac{1}{2^{m}}\mathbb{I}_{W}.

At each stage $0\leq t<T$ , the branching program receives a sample $(a,b)$ , where $a\in_{R}\mathcal{A}$ and $b=M(a,x)$ , and applies an operation $\Phi_{t,a,b}\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}$ over its memory state. Thus the evolution of the entire system can be written as

\rho^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right].

Finally, at stage $t=T$ , a measurement over the computational bases is applied on $\rho^{(T)}_{VW}$ , and the branching program outputs an element $\widetilde{x}\in X$ as a function of the measurement result $(v,w)\in\{0,1\}^{q+m}$ . The success probability of the program is the probability that $\widetilde{x}=x$ which can be formulated as

\sum_{\begin{subarray}{c}x\in\mathcal{X},v\in\{0,1\}^{q},w\in\mathcal{W}\\ \widetilde{x}(v,w)=x\end{subarray}}\langle x,v,w|\rho^{(T)}_{XVW}|x,v,w\rangle.

4 Main Result

Theorem 2.

Let $\mathcal{X},\mathcal{A}$ be two finite sets with $n=\log_{2}|\mathcal{X}|$ . Let $M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}$ be a matrix which is a $(k^{\prime},\ell^{\prime})$ - $L_{2}$ extractor with error $2^{-r^{\prime}}$ for sufficiently large $k^{\prime},\ell^{\prime}$ and $r^{\prime}$ , where $\ell^{\prime}\leq n$ . Let

r=\min\left\{\frac{1}{4}r^{\prime},\frac{1}{26}\ell^{\prime}+\frac{1}{6},\frac{1}{2}(k^{\prime}-1)\right\}.

Let $\rho$ be a branching program for the learning problem corresponding to $M$ , described by classical-quantum systems $\rho^{(t)}_{XVW}$ , with $q$ -qubit quantum memory $V$ , $m$ -bit classical memory $W$ and length $T$ . If $m\leq\frac{1}{44}(k^{\prime}-1)\ell^{\prime}$ , $q\leq r-7$ and $T\leq 2^{r-2}$ , the success probability of $\rho$ is at most $O(2^{q-r})$ .

From now on we let $k=k^{\prime}-1$ and $\ell=\frac{1}{5}(\ell^{\prime}-13r-2)$ . Then we have the following inequalities to be used later:

$\displaystyle q+r+1-r^{\prime}$	$\displaystyle\ \leq\ -2r.$	(4)
$\displaystyle 2\ell+9r-n$	$\displaystyle\ \leq\ -r.$	(5)
$\displaystyle(k-r)\ell$	$\displaystyle\ \geq\ 2m+4r+1.$	(6)

We leave a detailed calculation for the above inequalities in Appendix A.

4.1 Truncated Classical-Quantum Systems

Here we describe how to truncate a partial classical-quantum system $\rho_{XVW}$ according to some property $G(v,w)$ of desire on $\rho_{X|v,w}$ . The goal is to remove the parts of $\rho_{XVW}$ where $G$ is not satisfied. We execute the following procedure:

1.

Maintain a partial system $\rho^{\prime}_{XVW}$ initialized as $\rho_{XVW}$ , and subspaces $\mathcal{V}_{w}\subseteq\mathcal{V}$ initialized as $\mathcal{V}$ for each $w\in\mathcal{W}$ .
2.

Pick $w\in\mathcal{W}$ and $|v\rangle\in\mathcal{V}_{w}$ such that $\mathrm{Tr}[\rho^{\prime}_{X|v,w}]>0$ and $G(v,w)$ is false.

Change the partial system $\rho^{\prime}_{XVW}$ into the following system by projection:

\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v,w\rangle\langle v,w|)\big{)}\rho^{\prime}_{XVW}\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v,w\rangle\langle v,w|)\big{)},

and change $\mathcal{V}_{w}$ to its subspace orthogonal to $|v\rangle$ , that is

\{|v^{\prime}\rangle\in\mathcal{V}_{w}\mid\langle v|v^{\prime}\rangle=0\}.

4.

Repeat from step 2 until there is no such $w$ and $|v\rangle$ . Denote the final system as $\rho^{|G}_{XVW}$ .

In step 2 we pick $w$ and $|v\rangle$ arbitrarily as long as it satisfies the requirements, however we could always think of it as iterating over $w\in\mathcal{W}$ and processing each $\rho_{XV|w}$ separately. The choices of $|v\rangle$ for each $w$ do affect the final system $\rho^{|G}_{XVW}$ ; Yet as we will see later, these choices are irrelevant to our proof.

Below, we give two useful lemmas on truncated systems.

Lemma 4.1.

For every $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ such that $\mathrm{Tr}[\rho^{|G}_{X|v,w}]>0$ , there exists $|v^{\prime}\rangle$ in the remaining subspace $\mathcal{V}_{w}$ such that

P^{\rho^{|G}}_{X|v,w}=P^{\rho}_{X|v^{\prime},w}=P^{\rho^{|G}}_{X|v^{\prime},w}.

Proof.

It suffices to prove the lemma with one round of the truncation procedure executed. Suppose the $|v_{1},w_{1}\rangle$ is picked in step 2, resulting in the partial system

\rho^{\prime}_{XVW}=\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v_{1},w_{1}\rangle\langle v_{1},w_{1}|)\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v_{1},w_{1}\rangle\langle v_{1},w_{1}|)\big{)}.

We can write

	$\displaystyle\rho^{\prime}_{X\|v,w}$	$\displaystyle=\big{(}\mathbb{I}_{X}\otimes\langle v,w\|\big{)}\rho^{\prime}_{XVW}\big{(}\mathbb{I}_{X}\otimes\|v,w\rangle\big{)}$
		$\displaystyle=\big{(}\mathbb{I}_{X}\otimes(\langle v,w\|-\langle v,w\|v_{1},w_{1}\rangle\langle v_{1},w_{1}\|)\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(\|v,w\rangle-\|v_{1},w_{1}\rangle\langle v_{1},w_{1}\|v,w\rangle)\big{)}.$

•

If $w\neq w_{1}$ , then

\rho^{\prime}_{X|v,w}=\big{(}\mathbb{I}_{X}\otimes\langle v,w|\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes|v,w\rangle\big{)}=\rho_{X|v,w}.

And the lemma holds directly by choosing $|v^{\prime}\rangle=|v\rangle$ .

•

If $w=w_{1}$ , then with $\langle v_{1},w_{1}|v,w\rangle=\langle v_{1}|v\rangle=\lambda$ , we have

\rho^{\prime}_{X|v,w}=\big{(}\mathbb{I}_{X}\otimes(\langle v|-\overline{\lambda}\langle v_{1}|)\langle w|\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(|v\rangle-\lambda|v_{1}\rangle)|w\rangle\big{)}.

By the fact that $\mathrm{Tr}[\rho^{|G}_{X|v,w}]>0$ , we must have $|v\rangle\neq|v_{1}\rangle$ . Therefore if we let $|v^{\prime}\rangle\sim|v\rangle-\lambda|v_{1}\rangle$ , which is the normalized projection of $|v\rangle$ onto the orthogonal subspace of $|v_{1}\rangle$ , the above equality implies that $P^{\rho^{\prime}}_{X|v,w}=P^{\rho}_{X|v^{\prime},w}$ . Meanwhile, since $\langle v_{1}|v^{\prime}\rangle=0$ we have $\rho^{\prime}_{X|v^{\prime},w}=\rho_{X|v^{\prime},w}$ , which completes the proof. ∎

A direct corollary of the above lemma is that if $G(v,w)$ only depends on the distribution $P^{\rho}_{X|v,w}$ , then $G(v,w)$ holds for every $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ in the truncated system $\rho^{|G}_{XVW}$ , even when $|v\rangle$ is not in the remaining subspace $\mathcal{V}_{w}$ .

Our second lemma is based on the following fact that bounds the trace distance of a partial system and its projection, whose proof can be found in the Appendix B.

Proposition 4.2.

For every partial system $\rho$ and projection operator $\Pi$ on $\rho$ , we have

\big{\|}\rho-\Pi\rho\Pi\big{\|}^{2}_{\mathrm{Tr}}\leq 4\mathrm{Tr}[\rho]^{2}-4\mathrm{Tr}[\Pi\rho]^{2}.

Lemma 4.3.

For each $w\in W$ , let $|v_{1}\rangle,\ldots,|v_{d}\rangle$ be the states picked in step 2 within $\mathcal{V}_{w}$ . Then

\big{\|}\rho_{XV|w}-\rho^{|G}_{XV|w}\big{\|}_{\mathrm{Tr}}\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X|v_{i},w}]\mathrm{Tr}[\rho_{XV|w}]}.

Proof.

In Proposition 4.2, take $\rho$ to be $\rho_{XV|w}$ , and $\Pi$ to be

\mathbb{I}_{X}\otimes\prod_{i=1}^{d}\left(\mathbb{I}_{V}-|v_{i}\rangle\langle v_{i}|\right)=\mathbb{I}_{X}\otimes\left(\mathbb{I}_{V}-\sum_{i=1}^{d}|v_{i}\rangle\langle v_{i}|\right).

Then $\Pi\rho\Pi=\rho^{|G}_{XV|w}$ and $\mathrm{Tr}[\Pi\rho]=\mathrm{Tr}[\rho_{XV|w}]-\sum_{i=1}^{d}\mathrm{Tr}[\rho_{X|v_{i},w}]$ . Therefore we have

	$\displaystyle\big{\\|}\rho_{XV\|w}-\rho^{\|G}_{XV\|w}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle\leq\sqrt{4\mathrm{Tr}[\rho]^{2}-4\mathrm{Tr}[\Pi\rho]^{2}}$
		$\displaystyle\leq\sqrt{8(\mathrm{Tr}[\rho]-\mathrm{Tr}[\Pi\rho])\mathrm{Tr}[\rho]}$
		$\displaystyle=\sqrt{8\sum_{i=1}^{d}\mathrm{Tr}[\rho_{X\|v_{i},w}]\mathrm{Tr}[\rho_{XV\|w}]}$
		$\displaystyle\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X\|v_{i},w}]\mathrm{Tr}[\rho_{XV\|w}]}.\qed$

Since $\mathrm{Tr}[\rho_{XV|w}]\leq 1$ always holds, by summing over all $w\in\mathcal{W}$ we get the following corollary:

Corollary 4.4.

Let $|v_{1},w_{1}\rangle,\ldots,|v_{d},w_{d}\rangle$ be all of the memory states picked in step 2. Then

\big{\|}\rho_{XVW}-\rho^{|G}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X|v_{i},w_{i}}]}.

4.2 Truncated Branching Program

The properties that we desire for the partial system $\rho_{XVW}$ consist of three parts:

•

Small $L_{2}$ norm: Let $G_{2}(v,w)$ be the property that

$\big{\|}P^{\rho}_{X|v,w}\big{\|}_{2}\leq 2^{\ell}\cdot 2^{-n/2}.$
•

Small $L_{\infty}$ norm: Let $G_{\infty}(v,w)$ be the property that

$\big{\|}P^{\rho}_{X|v,w}\big{\|}_{\infty}\leq 2^{2\ell+9r}\cdot 2^{-n}.$
•

Even division: For every $a\in\mathcal{A}$ , let $G_{a}(v,w)$ be the property that

$|\langle M_{a},P^{\rho}_{X|v,w}\rangle|\leq 2^{-r}.$

Now we define the truncated branching program, by specifying the truncated partial classical-quantum system $\tau^{(t)}_{XVW}$ for each stage $t$ . Initially let $\tau^{(0)}_{XVW}=\rho^{(0)}_{XVW}$ . For each stage $0\leq t\leq T$ , the truncation consists of three ingredients (below we ignore the superscripts on $P$ for convenience):

1.

Remove parts where $\big{\|}P_{X|v,w}\big{\|}_{2}$ is large. That is, let $\tau^{(t,\star)}_{XVW}=\tau^{(t)|G_{2}}_{XVW}$ .
2.
Remove parts where $\big{\|}P_{X|v,w}\big{\|}_{\infty}$ is large. This is done by two steps.
- -
  
  First, let $g\in\{0,1\}^{\mathcal{X}\otimes\mathcal{W}}$ be an indicator vector such that $g(x,w)=1$ if and only if
  
  $\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]>0\textrm{ and }P^{\tau^{(t,\star)}}_{X|w}(x)\leq 2^{2\ell+5r}\cdot 2^{-n}.$
  
  Let $\tau^{(t,\circ)}_{XVW}=(gg^{\dagger}\otimes\mathbb{I}_{V})\tau^{(t,\star)}_{XVW}(gg^{\dagger}\otimes\mathbb{I}_{V})$ , where $gg^{\dagger}$ is the projection operator acting on $\mathcal{X}\otimes\mathcal{W}$ .
- -
  
  To make sure that the distributions did not change a lot after the projection $gg^{\dagger}$ , for each $0\leq t<T$ , let $G_{t}(v,w)$ be the property that
  
  $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]\geq(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}].$
  
  Let $\tau^{(t,\infty)}_{XVW}=\tau^{(t,\circ)|G_{\infty}\wedge G_{t}}_{XVW}$ .
3.

For each $a\in\mathcal{A}$ , remove (only for this $a$ ) parts where $P_{X|v,w}$ is not evenly divided by $a$ . That is, for each $a\in\mathcal{A}$ , let $\tau^{(t,a)}_{XVW}=\tau^{(t,\infty)|G_{a}}_{XVW}$ .

Then, if $t<T$ , for each $a\in_{R}\mathcal{A}$ we evolve the system by applying the sample operations $\Phi_{t,a,b}$ as the original branching program on $\tau^{(t,a)}_{XVW}$ , so that we have

\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

4.3 Bounding the Truncation Difference

In order to show that the success probability of the original branching program $\rho^{(t)}$ is low, the plan is to prove an upper bound on the success probability of the truncated branching program $\tau^{(t)}$ , and bound the difference between the two probabilities.

Here we bound the difference by the trace distance between the two systems $\rho^{(t)}_{XVW}$ and $\tau^{(t)}_{XVW}$ . We will show that the contribution to the trace distance from each one of the truncation ingredients is small, and in addition the evolution preserves the trace distance.

4.3.1 Truncation by $G_{2}$

Lemma 4.5.

For every $0\leq t\leq T$ , $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ such that $G_{2}(v,w)$ is violated (that is, $\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}>2^{\ell}\cdot 2^{-n/2}$ ), we must have $\mathrm{Tr}[\tau^{(t)}_{X|v,w}]<2^{-2m}\cdot 2^{-4r}$ .

The lemma says, for any direction $|v,w\rangle$ picked by the truncation procedure, the weight will be small and the truncation will not change the state significantly.

Proof.

This is our main technical lemma and we defer the proof to Section 5. ∎

Since there are at most $2^{q+m}$ such directions picked in the truncation procedure, we conclude the following corollary.

Corollary 4.6.

For every $0\leq t\leq T$ , we have $\big{\|}\tau^{(t,\star)}_{XVW}-\tau^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q-2r}$ .

Proof.

Recall that $\tau^{(t,\star)}_{XVW}=\tau^{(t)|G_{2}}_{XVW}$ . Since $\dim(\mathcal{V}\otimes\mathcal{W})=2^{q+m}$ , the truncation lasts for at most $2^{q+m}$ rounds. Since in every round the picked $|v,w\rangle$ has $\mathrm{Tr}[\tau^{(t)}_{X|v,w}]<2^{-2m}\cdot 2^{-4r}$ , by Corollary 4.4 we have

\big{\|}\tau^{(t,\star)}_{XVW}-\tau^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q+m}\cdot\sqrt{2^{-2m}\cdot 2^{-4r}}=3\cdot 2^{q-2r}.\qed

4.3.2 Truncation by $G_{\infty}$

Lemma 4.7.

For every $0\leq t\leq T$ and $w\in\mathcal{W}$ we have

\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)\leq 2^{-5r}.

Proof.

By 3.2, $P^{\tau^{(t,\star)}}_{X|w}$ is a convex combination of $P^{\tau^{(t,\star)}}_{X|v,w}$ . From Lemma 4.1 we know that $G_{2}(P^{\tau^{(t,\star)}}_{X|v,w})$ holds for every $|v\rangle$ and $w$ , and thus by convexity of $\ell_{2}$ -norms we know that $G_{2}(P^{\tau^{(t,\star)}}_{X|w})$ also holds. That means

\mathop{\mathbf{E}\mbox{}}\limits_{x\sim P^{\tau^{(t,\star)}}_{X|w}}\left[P^{\tau^{(t,\star)}}_{X|w}(x)\right]=\big{\|}P^{\tau^{(t,\star)}}_{X|w}\big{\|}^{2}_{2}\leq 2^{2\ell}\cdot 2^{-n}.

Therefore, by Markov’s inequality we have

\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)=\Pr_{x\sim P^{\tau^{(t,\star)}}_{X|w}}\left[P^{\tau^{(t,\star)}}_{X|w}(x)>2^{2\ell+5r}\cdot 2^{-n}\right]\leq 2^{-5r}.\qed

Corollary 4.8.

For every $0\leq t\leq T$ and every $w\in\mathcal{W}$ , we have $\tau^{(t,\circ)}_{XV|w}\leq\tau^{(t,\star)}_{XV|w}$ , and

\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]\geq(1-2^{-5r})\cdot\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}].

Moreover, we have $\big{\|}\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 2^{-5r}$ .

Proof.

Since $X$ and $W$ are both classical and $\tau^{(t,\circ)}_{XVW}=(gg^{\dagger}\otimes\mathbb{I}_{V})\tau^{(t,\star)}_{XVW}(gg^{\dagger}\otimes\mathbb{I}_{V})$ , we have

\tau^{(t,\star)}_{XV|w}-\tau^{(t,\circ)}_{XV|w}=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}|x\rangle\langle x|\otimes\tau^{(t,\star)}_{V|x,w},

which is positive semi-definite. Recalling that (Equation 3)

\mathrm{Tr}[\tau^{(t,\star)}_{V|x,w}]=\langle x,w|\tau^{(t,\star)}_{XW}|x,w\rangle=\operatorname{diag}\tau^{(t,\star)}_{X|w}(x)=P^{\tau^{(t,\star)}}_{X|w}(x)\mathrm{Tr}[\tau^{(t,\star)}_{X|w}],

we have

\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}\mathrm{Tr}[\tau^{(t,\star)}_{V|x,w}]=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)\cdot\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]\leq 2^{-5r}\cdot\mathrm{Tr}[\tau^{(t,\star)}_{X|w}].

And therefore, as $\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW}$ is positive semi-definite, we have

\big{\|}\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW}\big{\|}_{\mathrm{Tr}}=\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]\leq 2^{-5r}\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]\leq 2^{-5r}.\qed

Lemma 4.9.

For every $0\leq t\leq T$ , $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ such that $G_{\infty}(v,w)$ is violated (that is, $\big{\|}P^{\tau^{(t,\circ)}}_{X|v,w}\big{\|}_{\infty}>2^{2\ell+9r}\cdot 2^{-n}$ ) or $G_{t}(v,w)$ is violated (that is, $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}]$ ), we must have $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]$ .

Proof.

If $G_{\infty}(v,w)$ is violated, let $x\in\mathcal{X}$ be the one such that $P^{\tau^{(t,\circ)}}_{X|v,w}(x)>2^{2\ell+9r}\cdot 2^{-n}$ . If $g(x,w)=0$ then $P^{\tau^{(t,\circ)}}_{X|w}(x)=0$ , while if $g(x,w)=1$ then by Corollary 4.8,

P^{\tau^{(t,\circ)}}_{X|w}(x)\leq\frac{\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]}{\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]}\cdot 2^{2\ell+5r}\cdot 2^{-n}\leq(1-2^{-5r})^{-1}\cdot 2^{2\ell+5r}\cdot 2^{-n}.

Hence we always have

\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]\leq\frac{P^{\tau^{(t,\circ)}}_{X|w}(x)}{P^{\tau^{(t,\circ)}}_{X|v,w}(x)}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]\leq 2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}],

where the first inequality comes from the fact that $\tau^{(t,\circ)}_{X|w}\geq\tau^{(t,\circ)}_{X|v,w}$ and Equation 3.

If $G_{t}(v,w)$ is violated, since we know from Corollary 4.8 that

	$\displaystyle\left\|\mathrm{Tr}[\tau^{(t,\circ)}_{X\|v,w}]-\mathrm{Tr}[\tau^{(t,\star)}_{X\|v,w}]\right\|$	$\displaystyle\leq\big{\\|}\tau^{(t,\circ)}_{XV\|w}-\tau^{(t,\star)}_{XV\|w}\big{\\|}_{\mathrm{Tr}}\leq 2^{-5r}\cdot\mathrm{Tr}[\tau^{(t,\star)}_{XV\|w}]$
		$\displaystyle\leq 2^{-5r}\cdot(1-2^{-5r})^{-1}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{XV\|w}],$

therefore from $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}]$ we deduce that

	$\displaystyle\mathrm{Tr}[\tau^{(t,\circ)}_{X\|v,w}]$	$\displaystyle<(2^{r}-1)\cdot\left(\mathrm{Tr}[\tau^{(t,\star)}_{X\|v,w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{X\|v,w}]\right)$
		$\displaystyle\leq(2^{r}-1)\cdot 2^{-5r}\cdot(1-2^{-5r})^{-1}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{XV\|w}]$
		$\displaystyle<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X\|w}].\qed$

Corollary 4.10.

For every $0\leq t\leq T$ , we have $\big{\|}\tau^{(t,\infty)}_{XVW}-\tau^{(t,\circ)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 5\cdot 2^{q-2r}$ .

Proof.

Recall that $\tau^{(t,\infty)}_{XVW}=\tau^{(t,\circ)|G_{\infty}\wedge G_{t}}_{XVW}$ . For each $w\in\mathcal{W}$ , the truncation picks at most $\dim\mathcal{V}=2^{q}$ states $|v,w\rangle$ , each with $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]$ . Therefore by applying Lemma 4.3 for each $w\in\mathcal{W}$ , we have

\big{\|}\tau^{(t,\infty)}_{XVW}-\tau^{(t,\circ)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot\sum_{w\in\mathcal{W}}2^{q}\cdot\sqrt{2\cdot 2^{-4r}}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]\leq 5\cdot 2^{q-2r}.\qed

4.3.3 Truncation by $G_{a}$

Notice that in the truncation step from $\tau^{(t,\star)}$ to $\tau^{(t,\circ)}$ , the distribution $P^{\tau^{(t,\star)}}_{X|v,w}$ might change and not satisfy $G_{2}$ anymore. However, with the truncation by $G_{t}$ , any such distribution that changes too much is eliminated, and we have the following guarantee.

Lemma 4.11.

For every $0\leq t\leq T$ , $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ , we have

\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}.

Proof.

By Lemma 4.1, there exists $|v^{\prime}\rangle\in\mathcal{V}$ such that $P^{\tau^{(t,\infty)}}_{X|v,w}=P^{\tau^{(t,\infty)}}_{X|v^{\prime},w}=P^{\tau^{(t,\circ)}}_{X|v^{\prime},w}$ . The truncation by $G_{t}$ ensures that $\mathrm{Tr}[\tau^{(t,\circ)}_{X|v^{\prime},w}]\geq(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v^{\prime},w}]$ , and therefore

\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}=\big{\|}P^{\tau^{(t,\circ)}}_{X|v^{\prime},w}\big{\|}_{2}=\frac{\big{\|}\operatorname{diag}\tau^{(t,\circ)}_{X|v^{\prime},w}\big{\|}_{2}}{\mathrm{Tr}[\tau^{(t,\circ)}_{X|v^{\prime},w}]}\leq\frac{\big{\|}\operatorname{diag}\tau^{(t,\star)}_{X|v^{\prime},w}\big{\|}_{2}}{(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v^{\prime},w}]}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}.\qed

Lemma 4.12.

For every partial classical-quantum system $\tau_{XV}$ over $\mathcal{X}\otimes\mathcal{V}$ such that $\big{\|}P^{\tau}_{X|v}\big{\|}_{2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2}$ holds for every $|v\rangle\in\mathcal{V}$ , we have

\Pr_{a\in_{R}\mathcal{A}}\left[\exists|v\rangle\in\mathcal{V},|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r}\right]\leq 2^{-2r}.

Proof.

Notice that we can think of $\tau_{V}=\mathrm{Tr}_{X}[\tau_{XV}]$ to be $\mathbb{I}_{V}$ . This is because we can first assume that $\tau_{V}$ is full rank (otherwise change $\mathcal{V}$ to its subspace and the conclusion in this lemma still holds), and if we have diagonalization $Q^{\dagger}\tau_{V}Q=\mathbb{I}_{V}$ for some non-singular $Q$ , then consider the new system

\tau^{\prime}_{XV}=(\mathbb{I}_{X}\otimes Q^{\dagger})\tau_{XV}(\mathbb{I}_{X}\otimes Q),

and the set of distributions $\{P^{\tau}_{X|v}\}$ and $\{P^{\tau^{\prime}}_{X|v}\}$ over $|v\rangle\in\mathcal{V}$ are the same, since $P^{\tau^{\prime}}_{X|v}=P^{\tau}_{X|v^{\prime}}$ for $|v^{\prime}\rangle\sim Q|v\rangle$ . With $\tau_{V}=\mathbb{I}_{V}$ we have $\mathrm{Tr}[\tau_{X|v}]=1$ for every $|v\rangle\in\mathcal{V}$ , and thus $P^{\tau}_{X|v}=\operatorname{diag}\tau_{X|v}$ .

Let $\mathcal{A}^{\prime}\subseteq\mathcal{A}$ be the set of $a\in\mathcal{A}$ such that there exists $|v\rangle\in\mathcal{V}$ with $|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r}$ . For each $a\in\mathcal{A}^{\prime}$ , let

\sigma_{a}=\mathrm{Tr}_{X}[(\operatorname{Diag}M_{a}\otimes\mathbb{I}_{V})\tau_{XV}]

which is a Hermitian operator on $\mathcal{V}$ . There exists $|v\rangle\in\mathcal{V}$ such that

|\langle v|\sigma_{a}|v\rangle|=|\langle M_{a},\operatorname{diag}\tau_{X|v}\rangle|=|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r},

which means that $\|\sigma_{a}\|_{2}\geq 2^{-r}$ . Now let $|u\rangle$ be a uniformly random unit vector in $\mathcal{V}$ , and by Lemma 3.1 we know that for some absolute constant $c$ ,

\Pr_{|u\rangle}\left[|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}}\right]\geq 1-2^{(q+r-r^{\prime})/2}c-e^{-2^{q}}\geq 1-2^{-r}c-e^{-1}\geq 1/2.

The second last inequality comes from Eq. 4, while the last inequality is because of the assumption that $r$ is sufficiently large.

Since the above holds for every $a\in\mathcal{A}^{\prime}$ , it implies that $\Pr_{a\in\mathcal{A}^{\prime},|u\rangle}[|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}}]$ is at least $1/2$ . It means that there exists some $|u\rangle\in\mathcal{V}$ such that $|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}}$ for at least half of $a\in\mathcal{A}^{\prime}$ . On the other hand, since $M$ is a $(k^{\prime},\ell^{\prime})$ -extractor with error $2^{-r^{\prime}}$ , and $\big{\|}P^{\tau}_{X|u}\big{\|}_{2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2}$ , there are at most $2^{-k^{\prime}}$ fraction of $a\in\mathcal{A}$ such that $|\langle u|\sigma_{a}|u\rangle|=|\langle M_{a},P^{\tau}_{X|u}\rangle|\geq 2^{-r^{\prime}}$ . That means

\Pr_{a\in_{R}\mathcal{A}}\left[a\in\mathcal{A}^{\prime}\right]\leq 2\cdot 2^{-k^{\prime}}\leq 2^{-2r}.

Here $k^{\prime}-1\geq 2r$ , by the definition of $r$ . ∎

Corollary 4.13.

For every $0\leq t\leq T$ , we have $\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\tau^{(t,\infty)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 2^{-2r}$ .

Proof.

For each $w\in\mathcal{W}$ , the partial system $\tau^{(t,\infty)}_{XV|w}$ satisfies the condition of Lemma 4.12 since for every $|v\rangle\in\mathcal{V}$ ,

\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2}.

Notice that for each $a\in\mathcal{A}$ such that there does not exist $|v\rangle\in\mathcal{V}$ with $|\langle M_{a},P^{\tau^{(t,\infty)}}_{X|v,w}\rangle|\geq 2^{-r}$ (that is, when $G_{a}(v,w)$ holds for every $|v\rangle\in\mathcal{V}$ ), the sub system $\tau^{(t,\infty)}_{XV|w}$ is not touched in the truncation by $G_{a}$ and we have $\tau^{(t,a)}_{XV|w}=\tau^{(t,\infty)}_{XV|w}$ . Therefore

	$\displaystyle\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XVW}-\tau^{(t,\infty)}_{XVW}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle=\sum_{w\in\mathcal{W}}\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XV\|w}-\tau^{(t,\infty)}_{XV\|w}\big{\\|}_{\mathrm{Tr}}$
		$\displaystyle\leq\sum_{w\in\mathcal{W}}\Pr_{a\in_{R}\mathcal{A}}\left[\exists\|v\rangle\in\mathcal{V},\|\langle M_{a},P^{\tau^{(t,\infty)}}_{X\|v,w}\rangle\|\geq 2^{-r}\right]\cdot\mathrm{Tr}[\tau^{(t,\infty)}_{XV\|w}]$
		$\displaystyle\leq 2^{-2r}\cdot\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\infty)}_{XV\|w}]\leq 2^{-2r}.\qed$

4.3.4 Evolution preserves trace distance

Lemma 4.14.

For every $0\leq t<T$ , we have $\big{\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\rho^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}$ .

Proof.

Recall that

\rho^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right],

\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

Therefore by triangle inequality and contractivity of quantum channels under trace norms,

	$\displaystyle\big{\\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left\\|\sum_{x\in\mathcal{X}}\|x\rangle\langle x\|\otimes\left(\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW\|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW\|x}\big{)}\right)\right\\|_{\mathrm{Tr}}$
		$\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\left\\|\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW\|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW\|x}\big{)}\right\\|_{\mathrm{Tr}}$
		$\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\big{\\|}\tau^{(t,a)}_{VW\|x}-\rho^{(t)}_{VW\|x}\big{\\|}_{\mathrm{Tr}}$
		$\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XVW}-\rho^{(t)}_{XVW}\big{\\|}_{\mathrm{Tr}}.\qed$

4.4 Proof of Theorem 2

Proof.

First, combining Corollaries 4.6, 4.8, 4.10, 4.13 and 4.14 we have

\big{\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq\big{\|}\tau^{(t)}_{XVW}-\rho^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}+8\cdot 2^{q-2r}+2^{-5r}+2^{-2r}.

Since $\tau^{(0)}_{XVW}=\rho^{(0)}_{XVW}$ , by triangle inequality we know that $\big{\|}\tau^{(T)}_{XVW}-\rho^{(T)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq T\cdot 10\cdot 2^{q-2r}\leq 10\cdot 2^{q-r}$ , and thus

\big{\|}\tau^{(T,\infty)}_{XVW}-\rho^{(T)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 10\cdot 2^{q-r}+8\cdot 2^{q-2r}+2^{-5r}.

This bounds the difference between the measurement probabilities of $\tau^{(T,\infty)}_{XVW}$ and $\rho^{(T)}_{XVW}$ under any measurement, specifically the difference between the success probability of the branching program $\rho$ and the following value on $\tau$ :

\sum_{\begin{subarray}{c}x\in\mathcal{X},v\in\{0,1\}^{q},w\in\mathcal{W}\\ \widetilde{x}(v,w)=x\end{subarray}}\langle x,v,w|\tau^{(T,\infty)}_{XVW}|x,v,w\rangle=\sum_{v\in\{0,1\}^{q},w\in\mathcal{W}}\mathrm{Tr}[\tau^{(T,\infty)}_{X|v,w}]\cdot P^{\tau^{(T,\infty)}}_{X|v,w}(\widetilde{x}(v,w)).

Since $\big{\|}P^{\tau^{(T,\infty)}}_{X|v,w}\big{\|}_{\infty}\leq 2^{2\ell+9r}\cdot 2^{-n}$ and $\mathrm{Tr}[\tau^{(T,\infty)}_{XVW}]\leq 1$ , the above value is at most $2^{2\ell+9r}\cdot 2^{-n}$ . Therefore the success probability of the branching program $\rho$ is at most (recall that $2\ell+9r-n\leq-r$ )

10\cdot 2^{q-r}+8\cdot 2^{q-2r}+2^{-5r}+2^{2\ell+9r}\cdot 2^{-n}=O(2^{q-r}).\qed

5 Proof of Lemma 4.5

The first step towards proving Lemma 4.5 is to analyze how $P^{\tau^{(t)}}_{X|v,w}$ evolves according to the rule

\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

We introduce the following notations. For every $a\in\mathcal{A}$ and $b\in\{-1,1\}$ , let

\mathbbm{1}_{a,b}=\frac{1}{2}(\vec{1}+b\cdot M_{a}),

which is a $0$ - $1$ vector that indicates whether $M(a,x)=b$ . Let

\tau^{(t,a,b)}_{XVW}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{VW})\tau^{(t,a)}_{XVW},

(7)

so that we can write

\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1})\big{(}\tau^{(t,a,1)}_{XVW}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1})\big{(}\tau^{(t,a,-1)}_{XVW}\big{)}\right].

(8)

Thus 3.3 implies that $P^{\tau^{(t+1)}}_{X|v,w}$ is a convex combination of $P^{\tau^{(t,a,b)}}_{X|v^{\prime},w^{\prime}}$ for some $a,b,w^{\prime}$ and $|v^{\prime}\rangle$ .

5.1 Target Distribution and Badness

Before considering the target distribution, let us first establish that the $\ell_{2}$ -norms of $P^{\tau^{(t)}}_{X|v,w}$ cannot be too large:

Lemma 5.1.

For every $0\leq t\leq T$ , $|v\rangle\in\mathcal{V}$ , $w\in\mathcal{W}$ , we have

\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.

Proof.

When $t=0$ the statement is clearly true as $P^{\tau^{(0)}}_{X|v,w}$ is always uniform.

Now assume $t>0$ . By Lemma 4.1 and Lemma 4.11 we know that

\big{\|}P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}

for every $w^{\prime}\in\mathcal{W},|v^{\prime}\rangle\in\mathcal{V}$ and $a\in\mathcal{A}$ , as $\tau_{XVW}^{(t-1,a)}$ is truncated from $\tau_{XVW}^{(t-1,\infty)}$ . Since $G_{a}(P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}})$ is true, meaning that the distribution is evenly divided by $a$ , we further have

\big{\|}P^{\tau^{(t-1,a,b)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}=\frac{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{1}}\leq 2(1-2^{-r})^{-1}\cdot\big{\|}P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.

Since $P^{\tau^{(t)}}_{X|v,w}$ is a convex combination of $P^{\tau^{(t-1,a,b)}}_{X|v^{\prime},w^{\prime}}$ , by convexity its $\ell_{2}$ -norm is bounded by $4\cdot 2^{\ell}\cdot 2^{-n/2}$ . ∎

From now on we use $P$ to denote a fixed target distribution (which we will later choose to be the distribution in Lemma 4.5), such that

2^{\ell}\cdot 2^{-n/2}\leq\|P\|_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.

We want to bound the progress of $\langle P^{\tau^{(t)}}_{X|v,w},P\rangle$ , which starts off as $2^{-n}$ at $t=0$ , and becomes at least $2^{2\ell}\cdot 2^{-n}$ when $P^{\tau^{(t)}}_{X|v,w}=P$ . Note that by Cauchy-Schwarz we always have

\langle P^{\tau^{(t)}}_{X|v,w},P\rangle\leq\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}\big{\|}P\big{\|}_{2}\leq 16\cdot 2^{2\ell}\cdot 2^{-n}.

(9)

In order to bound the progress, we introduce some new notations. For any superscript (such as $(t,a)$ ) on the partial systems, we use $\sigma_{XVW}$ to denote $\tau_{XVW}(\operatorname{Diag}P\otimes\mathbb{I}_{VW})$ . Notice that

\mathrm{Tr}[\sigma_{X|v,w}]=\mathrm{Tr}[\tau_{X|v,w}\operatorname{Diag}P]=\mathrm{Tr}[\tau_{X|v,w}]\cdot\langle P^{\tau}_{X|v,w},P\rangle.

Similarly, $P^{\sigma}_{X|v,w}$ can be deduced from $P^{\tau}_{X|v,w}$ via

P^{\sigma}_{X|v,w}(x)=\frac{\mathrm{Tr}[\tau_{X|v,w}]}{\mathrm{Tr}[\sigma_{X|v,w}]}\cdot P^{\tau}_{X|v,w}(x)\cdot P(x)=\frac{P^{\tau}_{X|v,w}(x)\cdot P(x)}{\langle P^{\tau}_{X|v,w},P\rangle}.

(10)

Therefore we can bound the $\ell_{2}$ norm of $P^{\sigma}_{X|v,w}$ as

\big{\|}P^{\sigma}_{X|v,w}\big{\|}_{2}\leq\frac{1}{\langle P^{\tau}_{X|v,w},P\rangle}\cdot\big{\|}P^{\tau}_{X|v,w}\big{\|}_{\infty}\cdot\big{\|}P\big{\|}_{2}.

Now we can identity the places where $\langle P^{\tau^{(t)}}_{X|v,w},P\rangle$ increases by a lot, which happens when the inner product is not evenly divided by some $a\in\mathcal{A}$ (we will see the reason in the analysis later). Formally, at stage $0\leq t<T$ , we say $(w,a)$ is bad if

\exists|v\rangle\in\mathcal{V}\textrm{, s.t. }|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|>2^{-r}\textrm{ and }\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}.

(11)

Lemma 5.2.

For every $0\leq t<T$ and $w\in\mathcal{W}$ , we have

\Pr_{a\in_{R}\mathcal{A}}[\textrm{$(w,a)$ is bad}]\leq 2^{-k}.

Proof.

Since $\tau^{(t,a)}_{XVW}$ is truncated from $\tau^{(t,\infty)}_{XVW}$ , Lemma 4.1 shows that for every $|v\rangle\in\mathcal{V}$ , $w\in\mathcal{W}$ and $a\in\mathcal{A}$ there is $|v^{\prime}\rangle\in\mathcal{V}$ such that

P^{\tau^{(t,a)}}_{X|v,w}=P^{\tau^{(t,\infty)}}_{X|v^{\prime},w}

and by Eq. 10 it also implies that

P^{\sigma^{(t,a)}}_{X|v,w}=P^{\sigma^{(t,\infty)}}_{X|v^{\prime},w}.

Now fix some $w\in\mathcal{W}$ , and let $\mathcal{A}^{\prime}\subseteq\mathcal{A}$ be the set of of $a\in\mathcal{A}$ such that

\exists|v\rangle\in\mathcal{V}\textrm{, s.t. }|\langle M_{a},P^{\sigma^{(t,\infty)}}_{X|v,w}\rangle|>2^{-r}\textrm{ and }\langle P^{\tau^{(t,\infty)}}_{X|v,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}.

Then $\mathcal{A}^{\prime}$ contains all $a$ such that $(w,a)$ is bad, and our goal is to bound the fraction of $\mathcal{A}^{\prime}$ in $\mathcal{A}$ .

In the rest of the proof we temporarily omit the super script and write $\tau^{(t,\infty)}$ and $\sigma^{(t,\infty)}$ simply as $\tau$ and $\sigma$ . For the same reason as in Lemma 4.12 we can assume that $\tau_{V|w}=\mathbb{I}_{V}$ , and thus

\langle v|\sigma_{V|w}|v\rangle=\mathrm{Tr}[\sigma_{X|v,w}]=\langle P^{\tau}_{X|v,w},P\rangle,{\textrm{ and }\mathrm{Tr}[\sigma_{XV|w}]=\langle P^{\tau}_{X|w},P\rangle\leq 16\cdot 2^{2\ell}\cdot 2^{-n}.}

where the last inequality is by Lemma 4.11 and Cauchy-Schwarz, in the same way as Eq. 9.

Suppose that we have diagonalization $\sigma_{V|w}=U^{\dagger}DU$ , where $U$ is unitary and $D$ is diagonal and non-negative. Let $\mathcal{V}^{\prime}\subseteq\mathcal{V}$ be the subspace spanned by $U^{\dagger}|e\rangle$ over the computational basis vectors $|e\rangle\in\mathcal{V}$ such that $\langle e|D|e\rangle\geq 2^{-4r}\cdot{2^{-2\ell}}\cdot 2^{-n}$ . So for every $|v\rangle\in\mathcal{V}^{\prime}$ we have

\langle P^{\tau}_{X|v,w},P\rangle=\mathrm{Tr}[\sigma_{X|v,w}]\geq 2^{-4r}\cdot{2^{-2\ell}}\cdot 2^{-n}.

We claim that for every $a\in\mathcal{A}^{\prime}$ , there exists $|v\rangle\in\mathcal{V}^{\prime}$ such that $|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|>\frac{1}{2}\cdot 2^{-r}$ . To prove the claim, let $\Pi$ be the projection operator from $\mathcal{V}$ to $\mathcal{V}^{\prime}$ , and then $(\mathbb{I}_{X}\otimes\Pi)\sigma_{XV|w}(\mathbb{I}_{X}\otimes\Pi)$ can be conceptually seen as a truncated partial system $\sigma^{|G}_{XV|w}$ where $G(v,w)$ holds when $\mathrm{Tr}[\sigma_{X|v,w}]\geq 2^{-4r-2\ell}\cdot 2^{-n}$ for the fixed $w$ . By Lemma 4.3 we have

\big{\|}\sigma^{|G}_{XV|w}-\sigma_{XV|w}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q}\cdot{\sqrt{2^{-4r-2\ell-n}\cdot\mathrm{Tr}[\sigma_{XV|w}]}}\leq 12\cdot 2^{q-2r}\cdot 2^{-n}.

Since $a\in\mathcal{A}^{\prime}$ , assume for $|u\rangle\in\mathcal{V}$ we have $|\langle M_{a},P^{\sigma}_{X|u,w}\rangle|>2^{-r}$ and $\mathrm{Tr}[\sigma_{X|u,w}]=\langle P^{\tau}_{X|u,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}$ . Let $|v\rangle\sim\Pi|u\rangle$ , then we have

	$\displaystyle\big{\\|}P^{\sigma}_{X\|u,w}-P^{\sigma}_{X\|v,w}\big{\\|}_{1}=$	$\displaystyle\ \big{\\|}P^{\sigma}_{X\|u,w}-P^{\sigma^{\|G}}_{X\|u,w}\big{\\|}_{1}\leq\left\\|\frac{\sigma_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}-\frac{\sigma^{\|G}_{X\|u,w}}{\mathrm{Tr}[\sigma^{\|G}_{X\|u,w}]}\right\\|_{\mathrm{Tr}}$
	$\displaystyle\leq$	$\displaystyle\ \left\\|{\frac{\sigma_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}-\frac{\sigma^{\|G}_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}}\right\\|_{\mathrm{Tr}}+\left\\|{\frac{\sigma^{\|G}_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}-\frac{\sigma^{\|G}_{X\|u,w}}{\mathrm{Tr}[\sigma^{\|G}_{X\|u,w}]}}\right\\|_{\mathrm{Tr}}$
	$\displaystyle=$	$\displaystyle\ \left\\|{\frac{\sigma_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}-\frac{\sigma^{\|G}_{X\|u,w}}{\mathrm{Tr}[\sigma_{X\|u,w}]}}\right\\|_{\mathrm{Tr}}+\left\|\frac{1}{\mathrm{Tr}[\sigma_{X\|u,w}]}-\frac{1}{\mathrm{Tr}[\sigma^{\|G}_{X\|u,w}]}\right\|\cdot\mathrm{Tr}[\sigma^{\|G}_{X\|u,w}]$
	$\displaystyle=$	$\displaystyle\frac{\big{\\|}\sigma_{X\|u,w}-\sigma^{\|G}_{X\|u,w}\big{\\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X\|u,w}]}+\frac{\left\|\mathrm{Tr}[\sigma^{\|G}_{X\|u,w}]-\mathrm{Tr}[\sigma_{X\|u,w}]\right\|}{\mathrm{Tr}[\sigma_{X\|u,w}]}$
	$\displaystyle\leq$	$\displaystyle\ \frac{2\big{\\|}\sigma_{X\|u,w}-\sigma^{\|G}_{X\|u,w}\big{\\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X\|u,w}]}\leq\frac{2\big{\\|}\sigma_{XV\|w}-\sigma^{\|G}_{XV\|w}\big{\\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X\|u,w}]}\leq 48\cdot 2^{q-2r}\leq\frac{1}{2}\cdot 2^{-r},$

where the last step is due to $q\leq r-7$ . Thus

|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq|\langle M_{a},P^{\sigma}_{X|u,w}\rangle|-\big{\|}P^{\sigma}_{X|u,w}-P^{\sigma}_{X|v,w}\big{\|}_{1}>\frac{1}{2}\cdot 2^{-r}.

Similarly to the proof for Lemma 4.12, for each $a\in\mathcal{A}^{\prime}$ let

\pi_{a}=\mathrm{Tr}_{X}[(\operatorname{Diag}M_{a}\otimes U^{\dagger}D^{-1/2}U)\cdot\sigma_{XV|w}\cdot(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U)]

which is a Hermitian operator on $\mathcal{V}$ . For each $|v\rangle\in\mathcal{V}$ , let $|v^{\prime}\rangle\sim U^{\dagger}D^{1/2}U|v\rangle$ . Recall that $\sigma_{V|w}=U^{\dagger}DU$ , and therefore

	$\displaystyle P^{\sigma}_{X\|v,w}$	$\displaystyle=\frac{\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v\|)\sigma_{XV\|w}(\mathbb{I}_{X}\otimes\|v\rangle)}{\langle v\|\sigma_{V\|w}\|v\rangle}$
		$\displaystyle=\frac{\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}\|U^{\dagger}D^{-1/2}U)\sigma_{XV\|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U\|v^{\prime}\rangle)}{\langle v^{\prime}\|U^{\dagger}D^{-1/2}U\sigma_{V\|w}U^{\dagger}D^{-1/2}U\|v^{\prime}\rangle}$
		$\displaystyle=\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}\|U^{\dagger}D^{-1/2}U)\sigma_{XV\|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U\|v^{\prime}\rangle).$

And that means

\langle v^{\prime}|\pi_{a}|v^{\prime}\rangle=\left\langle M_{a},\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}|U^{\dagger}D^{-1/2}U)\sigma_{XV|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U|v^{\prime}\rangle)\right\rangle=\langle M_{a},P^{\sigma}_{X|v,w}\rangle.

We showed above that there exists $|v\rangle\in\mathcal{V}^{\prime}$ , and thus $|v^{\prime}\rangle\in\mathcal{V}^{\prime}$ such that

|\langle v^{\prime}|\pi_{a}|v^{\prime}\rangle|=\left|\langle M_{a},P^{\sigma}_{X|v,w}\rangle\right|\geq\frac{1}{2}\cdot 2^{-r},

which means that for $\Pi\pi_{a}\Pi$ , the restriction of $\pi_{a}$ on $\mathcal{V}^{\prime}$ , we have $\|\Pi\pi_{a}\Pi\|_{2}\geq\frac{1}{2}\cdot 2^{-r}$ . Now consider a uniformly random unit vector $|v^{\prime}\rangle$ in $\mathcal{V}^{\prime}$ , and by Lemma 3.1 we know that for some absolute constant $c$ ,

\Pr_{|v\rangle^{\prime}}\left[|\langle v^{\prime}|\sigma_{a}|v^{\prime}\rangle|\geq 2^{-r^{\prime}}\right]\geq 1-2^{(q+r+1-r^{\prime})/2}c-e^{-2^{q}}\geq 1-2^{-r}c-e^{-1}\geq\frac{1}{2}.

Therefore, for the random vector $|v\rangle\sim U^{\dagger}D^{-1/2}U|v^{\prime}\rangle$ where $|v^{\prime}\rangle$ is uniform in $\mathcal{V}^{\prime}$ , we conclude that

\Pr_{|v\rangle}\left[|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq 2^{-r^{\prime}}\right]\geq\frac{1}{2}.

On the other hand, as $|v^{\prime}\rangle\in\mathcal{V}^{\prime}$ , it also holds that $|v\rangle\in\mathcal{V}^{\prime}$ , therefore $\langle P^{\tau}_{X|v,w},P\rangle\geq 2^{-4r}\cdot 2^{-2\ell}\cdot 2^{-n}$ is always true. Thus there exists a $|v\rangle\in\mathcal{V}$ that simultaneously satisfies

\langle P^{\tau}_{X|v,w},P\rangle\geq 2^{-4r}\cdot 2^{-2\ell}\cdot 2^{-n}\quad\textrm{and}\quad|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq 2^{-r^{\prime}}

for at least $1/2$ of $a\in\mathcal{A}^{\prime}$ . Since

\big{\|}P^{\sigma}_{X|v,w}\big{\|}_{2}\leq\frac{1}{\langle P^{\tau}_{X|v,w},P\rangle}\cdot\big{\|}P^{\tau}_{X|v,w}\big{\|}_{\infty}\cdot\big{\|}P\big{\|}_{2}\leq 4\cdot 2^{{5\ell+13r}}\cdot 2^{-n/2}=2^{\ell^{\prime}}\cdot 2^{-n/2},

and $M$ is a $(k^{\prime},\ell^{\prime})$ -extractor with error $2^{-r^{\prime}}$ , there are at most $2^{-k^{\prime}}$ fraction of $a\in\mathcal{A}$ such that $|\langle M_{a},P^{\sigma}_{X|v^{\prime},w}\rangle|\geq 2^{-r^{\prime}}$ , which means that

\Pr_{a\in_{R}\mathcal{A}}[\textrm{$(w,a)$ is bad}]\leq\Pr_{a\in_{R}\mathcal{A}}[a\in\mathcal{A}^{\prime}]\leq 2\cdot 2^{-k^{\prime}}=2^{-k}.\qed

5.2 Badness Levels

At stage $t$ , for each classical memory state $w\in\mathcal{W}$ we count how many times the path to it has been bad, which is a random variable depending on the previous random choices of $a\in\mathcal{A}$ . This is stored in another classical register $B$ , which we call badness level and takes values $\beta\in\{0,\ldots,T\}$ . It is initially set to be $0$ , that is, we let

\tau^{(0)}_{XVWB}=\tau^{(0)}_{XVW}\otimes|0\rangle\langle 0|_{B}.

We ensure that the distribution of $B$ always only depends on $W$ and is independent of $X$ and $V$ conditioned on $W$ , using the following updating rules on the combined system $\tau_{XVWB}$ for each stage $0\leq t<T$ :

•

The truncation steps are executed independently of $B$ . Therefore, for each $a\in\mathcal{A}$ we let

\tau^{(t,a)}_{XVWB}=\sum_{w\in\mathcal{W}}\tau^{(t,a)}_{XV|w}\otimes|w\rangle\langle w|\otimes\operatorname{Diag}P^{\tau^{(t)}}_{B|w}.

(12)

•

The value of $B$ updates before the evolution step, where for each $a\in\mathcal{A}$ and $b\in\{-1,1\}$ we let

\tau^{(t,a,b)}_{XVWB}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{V}\otimes U_{a})\tau^{(t,a)}_{XVWB}(\mathbb{I}_{XV}\otimes U_{a}^{\dagger}).

Here $U_{a}$ is a permutation operator, depending on $\tau^{(t,a)}_{XVW}$ , acting on $\mathcal{W}\otimes\{0,\ldots,T\}$ such that

U_{a}|w\rangle|\beta\rangle=\left\{\begin{array}[]{ll}|w\rangle|(\beta+1)\mathop{\mathrm{mod}}(T+1)\rangle&\textrm{ if $(w,a)$ is bad,}\\ |w\rangle|\beta\rangle&\textrm{ otherwise.}\end{array}\right.

•

For the evolution step, we apply the channels $\Phi_{t,a,b}$ on the memories $W$ and $V$ to get

\tau^{(t+1)}_{XVWB}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1}\otimes\mathbb{I}_{B})\big{(}\tau^{(t,a,1)}_{XVWB}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1}\otimes\mathbb{I}_{B})\big{(}\tau^{(t,a,-1)}_{XVWB}\big{)}\right].

Notice that the evolution step might introduce dependencies between $X,V$ and $B$ . However, such dependencies are eliminated later due to how we handle the truncation steps (12), and thus do not affect our proof.

We can check that the combined partial system $\tau^{(t)}_{XVWB}$ defined above is consistent with the partial system $\tau^{(t)}_{XVW}$ that we discussed in previous sections, in the sense that $\mathrm{Tr}_{B}[\tau^{(t)}_{XVWB}]=\tau^{(t)}_{XVW}$ always holds:

•

For the truncation step, it is straightforward to check that

\mathrm{Tr}_{B}[\tau^{(t,a)}_{XVWB}]=\sum_{w\in\mathcal{W}}\tau^{(t,a)}_{XV|w}\otimes|w\rangle\langle w|=\tau^{(t,a)}_{XVW}.

•

The permutation operator $U_{a}$ acts on $\mathcal{W}$ as identity since

$\mathrm{Tr}_{B}\left[U_{a}|w,\beta\rangle\langle w,\beta|U_{a}^{\dagger}\right]=|w\rangle\langle w|.$

Recalling Eq. 7 that $\tau^{(t,a,b)}_{XVW}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{V})\tau^{(t,a)}_{XVW}$ , we have $\mathrm{Tr}_{B}[\tau^{(t,a,b)}_{XVWB}]=\tau^{(t,a,b)}_{XVW}$ .

•

The evolution step can be checked directly from the formula without $B$ (Eq. 8):

\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1})\big{(}\tau^{(t,a,1)}_{XVW}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1})\big{(}\tau^{(t,a,-1)}_{XVW}\big{)}\right].

So all previously proved properties about $\tau^{(t)}_{XVW}$ are preserved. In addition, we prove the following two properties about badness levels.

Lemma 5.3.

For every $0\leq t\leq T$ , $|v\rangle\in\mathcal{V}$ and $w\in\mathcal{W}$ , we have

\langle P^{\tau^{(t)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t}.

Proof.

We prove it by induction on $t$ . For $t=0$ the lemma is true as $\langle P^{\tau^{(t)}}_{X|v,w},P\rangle=2^{-n}$ and $P^{\tau^{(t)}}_{B|w}(0)=1$ .

Suppose the lemma holds true for some $t<T$ . By a similar argument as in Lemma 4.11 and applying Lemma 4.1 multiple times, we know that for every $|v\rangle\in\mathcal{V},w\in\mathcal{W}$ and $a\in\mathcal{A}$ , there exists $|v^{\prime}\rangle$ and $|v^{\prime\prime}\rangle\in\mathcal{V}$ such that

\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle=\langle P^{\tau^{(t,\circ)}}_{X|v^{\prime},w},P\rangle\leq(1-2^{-r})^{-1}\langle P^{\tau^{(t,\star)}}_{X|v^{\prime},w},P\rangle=(1-2^{-r})^{-1}\langle P^{\tau^{(t)}}_{X|v^{\prime\prime},w},P\rangle,

and therefore

\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-1}.

(13)

Also, the truncation step by $G_{a}$ implies that $|\langle M_{a},P^{\tau^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}$ . That is, for both $b\in\{-1,1\}$ ,

1-2^{-r}\leq 2\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}\leq 1+2^{-r}.

Therefore we have, unconditionally

\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle=\frac{\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}}\leq 2(1-2^{-r})^{-1}\cdot\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle.

(14)

When the inner product is evenly divided, i.e. $|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}$ , we further have

\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\frac{1}{2}(1+2^{-r})\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\frac{1}{2}(1-2^{-r})^{-1}\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle,

which means that

\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle=\frac{\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}}\leq(1-2^{-r})^{-2}\cdot\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle.

(15)

Now there are three cases to discuss:

•

If $(w,a)$ is bad, we have $P^{\tau^{(t,a,b)}}_{B|w}(\beta)=P^{\tau^{(t)}}_{B|w}(\beta-1)$ for every $\beta>0$ . Notice that $P^{\tau^{(t)}}_{B|w}(T)=0$ as $t<T$ , and thus Eq. 13 and Eq. 14 imply that

	$\displaystyle\langle P^{\tau^{(t,a,b)}}_{X\|v,w},P\rangle$	$\displaystyle\leq\sum_{\beta=0}^{T-1}P^{\tau^{(t)}}_{B\|w}(\beta)\cdot 2^{\beta+1}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-2}$
		$\displaystyle\leq\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B\|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.$

•

If $(w,a)$ is not bad and $|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}$ , we have $P^{\tau^{(t,a,b)}}_{B|w}(\beta)=P^{\tau^{(t)}}_{B|w}(\beta)$ for every $\beta\geq 0$ . Then Eq. 13 and Eq. 15 imply that

	$\displaystyle\langle P^{\tau^{(t,a,b)}}_{X\|v,w},P\rangle$	$\displaystyle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B\|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-3}$
		$\displaystyle=\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B\|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.$

•

If $(w,a)$ is not bad and $|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|>2^{-r}$ , by the definition of badness (11) we must have $\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle<\frac{1}{2}\cdot 2^{-n}$ . Thus by Eq. 14,

\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle<(1-2^{-r})^{-1}\cdot 2^{-n}\leq\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.

The last inequality follows from $\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}\geq 2^{-n}(1-2^{-r})^{-3(t+1)}$ . Hence we obtain the same conclusion from all three cases.

For the evolution step, since $B$ is classical we can view $X$ and $B$ as a whole and apply 3.3 on $P^{\tau^{(t+1)}}_{XB|v,w}$ , which asserts that $P^{\tau^{(t+1)}}_{XB|v,w}$ is a convex combination of $P^{\tau^{(t,a,b)}}_{XB|v^{\prime},w^{\prime}}$ for some $a,b,w^{\prime}$ and $|v^{\prime}\rangle$ . Then by linearity we conclude that ⁶⁶6It should be noted that in $\tau^{(t+1)}$ , $X$ and $B$ are not independent. (In $\tau^{(t,a,b)}$ they are independent (conditioned on $v^{\prime},w^{\prime}$ )). Nevertheless, independence of $X,B$ (in $\tau^{(t+1)}$ ) is not needed or used here and we can conclude the final inequality by linearity by taking the corresponding convex combination of all inequalities.

\langle P^{\tau^{(t+1)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t+1)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.\qed

Lemma 5.4.

For every $0\leq\beta\leq t\leq T$ we have

\langle\beta|\tau^{(t)}_{B}|\beta\rangle\leq 2^{-k\beta}\binom{t}{\beta}.

Proof.

We prove it by induction on $t$ . For $t=0$ the lemma holds as $\tau^{(0)}_{B}=|0\rangle\langle 0|_{B}$ . Also notice that the lemma is trivially true for every $t$ when $\beta=0$ .

Now suppose the lemma holds for some $t$ . By definition we have

\tau^{(t+1)}_{B}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}[\tau^{(t,a,1)}_{B}+\tau^{(t,a,-1)}_{B}]=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\mathrm{Tr}_{W}[U_{a}\tau^{(t,a)}_{WB}U_{a}^{\dagger}].

Therefore

\langle\beta|\tau^{(t+1)}_{B}|\beta\rangle=\sum_{w\in\mathcal{W}}\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\langle w,\beta|U_{a}\tau^{(t,a)}_{WB}U_{a}^{\dagger}|w,\beta\rangle\right].

By Lemma 5.2 we know that for every $w\in\mathcal{W}$ , the probability that $(w,a)$ is bad for $a\in_{R}\mathcal{A}$ is at most $2^{-k}$ . In other words, for every $\beta>0$ ,

U_{a}^{\dagger}|w,\beta\rangle=\left\{\begin{array}[]{ll}|w,\beta\rangle,&\textrm{ w.p. }\geq 1-2^{-k}\\ |w,\beta-1\rangle,&\textrm{ w.p. }\leq 2^{-k}\end{array}\right.

where the probability is taken over the random choice of $a$ . It means that

	$\displaystyle\langle\beta\|\tau^{(t+1)}_{B}\|\beta\rangle$	$\displaystyle\leq\sum_{w\in\mathcal{W}}\langle w,\beta\|\tau^{(t,a)}_{WB}\|w,\beta\rangle+2^{-k}\sum_{w\in\mathcal{W}}\langle w,\beta-1\|\tau^{(t,a)}_{WB}\|w,\beta-1\rangle$
		$\displaystyle=\langle\beta\|\tau^{(t,a)}_{B}\|\beta\rangle+2^{-k}\cdot\langle\beta-1\|\tau^{(t,a)}_{B}\|\beta-1\rangle.$

Notice that

\tau^{(t,a)}_{B}=\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,a)}_{XV|w}]\cdot\operatorname{Diag}P^{\tau^{(t)}}_{B|w}\leq\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t)}_{XV|w}]\cdot\operatorname{Diag}P^{\tau^{(t)}}_{B|w}=\tau^{(t)}_{B},

and thus we conclude that

	$\displaystyle\langle\beta\|\tau^{(t+1)}_{B}\|\beta\rangle$	$\displaystyle\leq\langle\beta\|\tau^{(t)}_{B}\|\beta\rangle+2^{-k}\cdot\langle\beta-1\|\tau^{(t)}_{B}\|\beta-1\rangle$
		$\displaystyle\leq 2^{-k\beta}\binom{t}{\beta}+2^{-k}\cdot 2^{-k(\beta-1)}\binom{t}{\beta-1}=2^{-k\beta}\binom{t+1}{\beta}.\qed$

With the lemmas above in hand, we can finally prove Lemma 4.5.

Proof for Lemma 4.5.

For the target distribution $P=P^{\tau^{(t)}}_{X|v,w}$ we have $\langle P^{\tau^{(t)}}_{X|v,w},P\rangle>2^{2\ell}\cdot 2^{-n}$ , so by Lemma 5.3,

\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot(1-2^{-r})^{-3t}>2^{2\ell}.

Since $t\leq T\leq 2^{r-2}$ , we have $(1-2^{-r})^{-3t}\leq 2$ , and thus

\sum_{\beta=\ell}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}>\frac{1}{2}\left(2^{2\ell}-2\cdot\sum_{\beta=0}^{\ell-1}2^{\beta}\right)>2^{\ell}.

On the other hand, for every $\beta\geq\ell$ , by Lemma 5.4,

\mathrm{Tr}[\tau^{(t)}_{B|w}]\cdot P^{\tau^{(t)}}_{B|w}(\beta)\leq\langle\beta|\tau^{(t)}_{B}|\beta\rangle\leq(2^{-k}t)^{\beta}<2^{-(k-r)\beta},

and thus by Eq. 6,

\mathrm{Tr}[\tau^{(t)}_{X|v,w}]\leq\mathrm{Tr}[\tau^{(t)}_{B|w}]<2^{-\ell}\sum_{\beta=\ell}^{T}2^{-(k-r)\beta}\cdot 2^{\beta}\leq 2\cdot 2^{-(k-r)\ell}\leq 2^{-2m}\cdot 2^{-4r}.\qed

Acknowledgement

We are grateful to Uma Girish for many important discussions and suggestions on the draft of this paper, and to the anonymous reviewers for their helpful comments.

References

[AAB⁺19] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
[Aar18] Scott Aaronson. Shadow tomography of quantum states. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 325–338, 2018.
[ACQ22] Dorit Aharonov, Jordan Cotler, and Xiao-Liang Qi. Quantum algorithmic measurement. Nature communications, 13(1):1–9, 2022.
[ADR02] Yonatan Aumann, Yan Zong Ding, and Michael O Rabin. Everlasting security in the bounded storage model. IEEE Transactions on Information Theory, 48(6):1668–1680, 2002.
[AR99] Yonatan Aumann and Michael O Rabin. Information theoretically secure communication in the limited storage space model. In Annual International Cryptology Conference, pages 65–79. Springer, 1999.
[Aud07] Koenraad MR Audenaert. A sharp continuity estimate for the von neumann entropy. Journal of Physics A: Mathematical and Theoretical, 40(28):8127, 2007.
[BCL20] Sebastien Bubeck, Sitan Chen, and Jerry Li. Entanglement is necessary for optimal quantum property testing. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 692–703. IEEE, 2020.
[BGY18] Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 843–856. PMLR, 2018.
[BY21] Anne Broadbent and Peter Yuen. Device-independent oblivious transfer from the bounded-quantum-storage-model and computational assumptions. arXiv preprint arXiv:2111.08595, 2021.
[CCHL22] Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. Exponential separations between learning with and without quantum memory. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 574–585. IEEE, 2022.
[CLO22] Sitan Chen, Jerry Li, and Ryan O’Donnell. Toward instance-optimal state certification with incoherent measurements. In Conference on Learning Theory, pages 2541–2596. PMLR, 2022.
[CM97] Christian Cachin and Ueli Maurer. Unconditional security against memory-bounded adversaries. In Annual International Cryptology Conference, pages 292–306. Springer, 1997.
[CW01] Anthony Carbery and James Wright. Distributional and $L^{q}$ norm inequalities for polynomials over convex bodies in $\mathbb{R}^{n}$ . Mathematical Research Letters, 8(3):233–248, 2001.
[DFR⁺07] Ivan B Damgård, Serge Fehr, Renato Renner, Louis Salvail, and Christian Schaffner. A tight high-order entropic quantum uncertainty relation with applications. In Annual International Cryptology Conference, pages 360–378. Springer, 2007.
[DFSS07] Ivan B Damgård, Serge Fehr, Louis Salvail, and Christian Schaffner. Secure identification and qkd in the bounded-quantum-storage model. In Annual International Cryptology Conference, pages 342–359. Springer, 2007.
[DFSS08] Ivan B Damgård, Serge Fehr, Louis Salvail, and Christian Schaffner. Cryptography in the bounded-quantum-storage model. SIAM Journal on Computing, 37(6):1865–1890, 2008.
[DM02] Stefan Dziembowski and Ueli Maurer. Tight security proofs for the bounded-storage model. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 341–350, 2002.
[DM04] Stefan Dziembowski and Ueli Maurer. On generating the initial key in the bounded-storage model. In International Conference on the Theory and Applications of Cryptographic Techniques, pages 126–137. Springer, 2004.
[DQW21] Yevgeniy Dodis, Willy Quach, and Daniel Wichs. Speak much, remember little: Cryptography in the bounded storage model, revisited. Cryptology ePrint Archive, 2021.
[DQW22] Yevgeniy Dodis, Willy Quach, and Daniel Wichs. Authentication in the bounded storage model. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 737–766. Springer, 2022.
[DR02] Yan Zong Ding and Michael O Rabin. Hyper-encryption and everlasting security. In Annual Symposium on Theoretical Aspects of Computer Science, pages 1–26. Springer, 2002.
[FvdG99] Christopher A Fuchs and Jeroen van de Graaf. Cryptographic distinguishability measures for quantum-mechanical states. IEEE Transactions on Information Theory, 45(4):1216–1227, 1999.
[GKK⁺08] Dmitry Gavinsky, Julia Kempe, Iordanis Kerenidis, Ran Raz, and Ronald de Wolf. Exponential separation for one-way quantum communication complexity, with applications to cryptography. SIAM J. Comput., 38(5):1695–1708, 2008.
[GKLR21] Sumegha Garg, Pravesh K. Kothari, Pengda Liu, and Ran Raz. Memory-sample lower bounds for learning parity with noise. In APPROX-RANDOM, volume 207 of LIPIcs, pages 60:1–60:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
[GKR20] Sumegha Garg, Pravesh K. Kothari, and Ran Raz. Time-space tradeoffs for distinguishing distributions and applications to security of goldreich’s PRG. In APPROX-RANDOM, volume 176 of LIPIcs, pages 21:1–21:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
[GRT18] Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 990–1002, New York, NY, USA, 2018. Association for Computing Machinery.
[GZ19] Jiaxin Guan and Mark Zhandary. Simple schemes in the bounded storage model. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 500–524. Springer, 2019.
[HHJ⁺16] Jeongwan Haah, Aram W Harrow, Zhengfeng Ji, Xiaodi Wu, and Nengkun Yu. Sample-optimal tomography of quantum states. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 913–925, 2016.
[HKP20] Hsin-Yuan Huang, Richard Kueng, and John Preskill. Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050–1057, 2020.
[HN06] Danny Harnik and Moni Naor. On everlasting security in the hybrid bounded storage model. In International Colloquium on Automata, Languages, and Programming, pages 192–203. Springer, 2006.
[KK12] Roy Kasher and Julia Kempe. Two-source extractors secure against quantum adversaries. Theory of Computing, 8(1):461–486, 2012.
[KRT17] Gillat Kol, Ran Raz, and Avishay Tal. Time-space hardness of learning sparse parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1067–1080, 2017.
[LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5), October 2000.
[Lu02] Chi-Jen Lu. Hyper-encryption against space-bounded adversaries from on-line strong extractors. In Annual International Cryptology Conference, pages 257–271. Springer, 2002.
[LV21] Jiahui Liu and Satyanarayana Vusirikala. Secure multiparty computation in the bounded storage model. In IMA International Conference on Cryptography and Coding, pages 289–325. Springer, 2021.
[Mau92] Ueli M Maurer. Conditionally-perfect secrecy and a provably-secure randomized cipher. Journal of Cryptology, 5(1):53–66, 1992.
[MM18] Dana Moshkovitz and Michal Moshkovitz. Entropy samplers and strong generic lower bounds for space bounded learning. In ITCS, volume 94 of LIPIcs, pages 28:1–28:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
[MSTS04] Tal Moran, Ronen Shaltiel, and Amnon Ta-Shma. Non-interactive timestamping in the bounded storage model. In Annual International Cryptology Conference, pages 460–476. Springer, 2004.
[OP04] Masanori Ohya and Dénes Petz. Quantum entropy and its use. Springer Science & Business Media, 2004.
[PMLA13] Stefano Pironio, Ll Masanes, Anthony Leverrier, and Antonio Acín. Security of device-independent quantum key distribution in the bounded-quantum-storage model. Physical Review X, 3(3):031007, 2013.
[Raz17] Ran Raz. A time-space lower bound for a large class of learning problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–742, 2017.
[Raz18] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. J. ACM, 66(1), dec 2018.
[RS08] Alvin C Rencher and G Bruce Schaalje. Linear models in statistics. John Wiley & Sons, 2008.
[Sch07] Christian Schaffner. Cryptography in the bounded-quantum-storage model. arXiv preprint arXiv:0709.0289, 2007.
[Sha14] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. Advances in Neural Information Processing Systems, 27, 2014.
[SSV19] Vatsal Sharan, Aaron Sidford, and Gregory Valiant. Memory-sample tradeoffs for linear regression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 890–901, 2019.
[SVW16] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries. In Conference on Learning Theory, pages 1490–1516. PMLR, 2016.
[Uhl76] Armin Uhlmann. The “transition probability” in the state space of a *-algebra. Reports on Mathematical Physics, 9(2):273–279, 1976.
[Wri16] John Wright. How to learn a quantum state. PhD thesis, Carnegie Mellon University, 2016.
[WW08] Stephanie Wehner and Jürg Wullschleger. Composable security in the bounded-quantum-storage model. In International Colloquium on Automata, Languages, and Programming, pages 604–615. Springer, 2008.

Appendix A Bounding Parameters for Theorem 2

Proof.

For Equation 4, since $r\leq r^{\prime}/4$ by the definition of $r$ and $q\leq r-7$ , we have:

\displaystyle q+r+1-r^{\prime}\leq q+r+1-4r\leq-2r-6\leq-2r.

For Equation 5, we use the fact that $\ell=\frac{1}{5}(\ell^{\prime}-13r-2)$ , therefore

	$\displaystyle 2\ell+9r-n=$	$\displaystyle\frac{2}{5}(\ell^{\prime}-13r-2)+9r-n$
	$\displaystyle=$	$\displaystyle\frac{2}{5}\ell^{\prime}+\frac{19}{5}r-\frac{4}{5}-n.$

Solving the inequality $\frac{2}{5}\ell^{\prime}+\frac{19}{5}r-\frac{4}{5}-n\leq-r$ (together with the fact that $\ell^{\prime}\leq n$ ) gives us $r\leq\frac{1}{8}\ell^{\prime}+\frac{1}{6}$ , which follows since $r\leq\frac{1}{26}\ell^{\prime}+\frac{1}{6}$ by definition.

For Equation 6, as $r\leq\frac{1}{26}\ell^{\prime}+\frac{1}{6}$ , $r\leq\frac{1}{2}(k^{\prime}-1)$ and $k=k^{\prime}-1$ ,

	$\displaystyle k-r$	$\displaystyle\geq k-(k^{\prime}-1)/2=(k^{\prime}-1)/2,$
	$\displaystyle\ell$	$\displaystyle=\frac{1}{5}(\ell^{\prime}-13r-2)\geq\frac{1}{5}\left(\ell^{\prime}-\frac{\ell^{\prime}}{2}-\frac{13}{6}-2\right)>\frac{1}{10}\ell^{\prime}-1.$

When $\ell^{\prime}$ is sufficiently large, we have $\frac{1}{10}\ell^{\prime}-1>\frac{1}{11}\ell^{\prime}+5$ . Thereby, using the fact that $m\leq(k^{\prime}-1)\ell^{\prime}/44$ :

	$\displaystyle(k-r)\ell$	$\displaystyle\geq\frac{1}{2}(k^{\prime}-1)\left(\frac{1}{11}\ell^{\prime}+5\right)$
		$\displaystyle=\frac{1}{22}(k^{\prime}-1)\ell^{\prime}+\frac{5}{2}(k^{\prime}-1)$
		$\displaystyle\geq 2m+5r$
		$\displaystyle\geq 2m+4r+1.\qed$

Appendix B Proof for Proposition 4.2

We first state a variant of Fuchs-van de Graaf inequality [FvdG99] on fidelity, defined for two partial density operators $\rho$ and $\sigma$ as

F(\rho,\sigma)=\mathrm{Tr}\left[\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}\right].

Lemma B.1.

Let $\rho,\sigma$ be two semi-definite operators. Assume $\mathrm{Tr}[\rho]\geq\mathrm{Tr}[\sigma]$ . Then

\displaystyle\frac{1}{2}\big{\|}\rho-\sigma\big{\|}_{\mathrm{Tr}}\leq\sqrt{\frac{1}{4}(\mathrm{Tr}[\rho]+\mathrm{Tr}[\sigma])^{2}-F(\rho,\sigma)}\leq\sqrt{\mathrm{Tr}[\rho]^{2}-F(\rho,\sigma)}.

Notice that when $\mathrm{Tr}[\rho]=\mathrm{Tr}[\sigma]=1$ , the above inequality is the original Fuchs-van de Graaf inequality.

Proof.

Let $u$ and $v$ be purifications of $\rho$ and $\sigma$ , that is, $u,v\in\mathcal{H}\otimes\mathcal{H}_{A}$ with $\rho=\mathrm{Tr}_{A}[uu^{\dagger}]$ and $\sigma=\mathrm{Tr}_{A}[vv^{\dagger}]$ , where $\mathcal{H}$ is the ambient space of $\rho$ and $\sigma$ , and $\mathcal{H}_{A}$ is some finite-dimensional Hilbert space.

Let $U$ be a unitary that diagonalizes $uu^{\dagger}-vv^{\dagger}$ , that is there is a diagonal matrix $\Lambda\in\mathbb{C}^{d\times d}$ such that $uu^{\dagger}-vv^{\dagger}=U\Lambda U^{\dagger}$ . Let $p,q\in\mathbb{R}^{d}_{\geq 0}$ be the diagonal of $U^{\dagger}uu^{\dagger}U$ and $U^{\dagger}vv^{\dagger}U$ respectively. We have

	$\displaystyle u^{\dagger}u=\mathrm{Tr}[U^{\dagger}uu^{\dagger}U]$	$\displaystyle=\sum_{x\in[d]}p(x),$
	$\displaystyle v^{\dagger}v=\mathrm{Tr}[U^{\dagger}vv^{\dagger}U]$	$\displaystyle=\sum_{x\in[d]}q(x),$
	$\displaystyle\big{\\|}uu^{\dagger}-vv^{\dagger}\big{\\|}_{\mathrm{Tr}}=\big{\\|}\Lambda\big{\\|}_{\mathrm{Tr}}$	$\displaystyle=\sum_{x\in[d]}\|p(x)-q(x)\|,$
	$\displaystyle\|\langle u,v\rangle\|=\|\langle U^{\dagger}u,U^{\dagger}v\rangle\|$	$\displaystyle\leq\sum_{x\in[d]}\sqrt{p(x)q(x)}.$

Therefore, by Cauchy-Schwarz inequality,

	$\displaystyle\big{\\|}uu^{\dagger}-vv^{\dagger}\big{\\|}^{2}_{\mathrm{Tr}}$	$\displaystyle=\left(\sum_{x\in[d]}\|p(x)-q(x)\|\right)^{2}$
		$\displaystyle=\left(\sum_{x\in[d]}\left\|\sqrt{p(x)}-\sqrt{q(x)}\right\|\cdot\left\|\sqrt{p(x)}+\sqrt{q(x)}\right\|\right)^{2}$
		$\displaystyle\leq\left(\sum_{x\in[d]}\left\|\sqrt{p(x)}-\sqrt{q(x)}\right\|^{2}\right)\left(\sum_{x\in[d]}\left\|\sqrt{p(x)}+\sqrt{q(x)}\right\|^{2}\right)$
		$\displaystyle=\left(\sum_{x\in[d]}p(x)+\sum_{x\in[d]}q(x)\right)^{2}-4\left(\sum_{x\in[d]}\sqrt{p(x)q(x)}\right)^{2}$
		$\displaystyle\leq\left(u^{\dagger}u+v^{\dagger}v\right)^{2}-4\|\langle u,v\rangle\|^{2}.$

Notice that $\big{\|}\rho-\sigma\big{\|}_{\mathrm{Tr}}\leq\big{\|}uu^{\dagger}-vv^{\dagger}\big{\|}_{\mathrm{Tr}}$ , $\mathrm{Tr}[\rho]=u^{\dagger}u$ and $\mathrm{Tr}[\sigma]=v^{\dagger}v$ . By Uhlmann’s theorem [Uhl76], we can also choose $u$ and $v$ such that $F(\rho,\sigma)=|\langle u,v\rangle|^{2}$ . Plugging them into the above inequality concludes the proof. ∎

Now we are ready to prove Proposition 4.2

Proof for Proposition 4.2.

By Fuchs-van de Graaf inequality, it suffices to prove the following bound on fidelity:

F(\rho,\Pi\rho\Pi)\geq\mathrm{Tr}[\Pi\rho]^{2}.

Let $u$ be a purification of $\rho_{XV}$ , that is, $\rho=\mathrm{Tr}_{A}[uu^{\dagger}]$ for some Hilbert space $\mathcal{H}_{A}$ . Then $\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}u$ is a purification of $\Pi\rho\Pi$ . By Uhlmann’s theorem we have

F(\rho,\Pi\rho\Pi)\geq\left|u^{\dagger}\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}u\right|^{2}=\mathrm{Tr}\left[\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}uu^{\dagger}\right]^{2}=\mathrm{Tr}\left[\Pi\cdot\mathrm{Tr}_{A}[uu^{\dagger}]\right]^{2}=\mathrm{Tr}[\Pi\rho]^{2}.\qed

Appendix C Linear Quantum Memory Lower Bound

In this appendix, we prove Theorem 3 that shows a simpler proof for a linear quantum-memory lower bound (without classical memory). While Theorem 3 is qualitatively weaker than our main results in most cases, as it only gives a lower bound for programs with a linear-size quantum memory but without a (possibly quadratic) classical memory, Theorem 3 is technically incomparable to the main results, as it’s stated in terms of quantum extractors and the bound on the quantum-memory size depends on a different parameter of the extractor. Additionally, the proof of Theorem 3 is significantly simpler than the proof of our main theorem.

We first define the quantum extractor property that we need, which is a simplified version of the ones considered in [KK12]. Given a matrix $M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}$ , consider two independent sources $A$ and $X$ uniformly distributed over $\mathcal{A}$ and $\mathcal{X}$ respectively. Suppose there is some quantum register $V$ whose state depends on $A$ and $X$ , and they together form a classical-quantum system

\rho_{AXV}=\bigoplus_{a\in\mathcal{A},x\in\mathcal{X}}\rho_{V|a,x},

where $\rho_{V|a,x}$ is the state of $V$ when $A=a$ and $X=x$ . For any function $f$ on $A\times X$ , we say that $V$ depends only on $f(A,X)$ if for any $a,a^{\prime}\in\mathcal{A}$ and $x,x^{\prime}\in\mathcal{X}$ , whenever $f(a,x)=f(a^{\prime},x^{\prime})$ we have $\rho_{V|a,x}=\rho_{V|a^{\prime},x^{\prime}}$ . In particular, $V$ depending only on $A$ is equivalent to $V$ being independent of $X$ , or $\rho_{XV}=\rho_{X}\otimes\rho_{V}$ .

We say that $M$ is an $X$ -strong $(q,r)$ -quantum extractor, if for every classical-quantum system $\rho_{AXV}$ , as above, with the $q$ -qubit quantum subsystem $V$ that depends only on $A$ , it holds that

\left\|\rho_{M(A,X)XV}-U\otimes\rho_{X}\otimes\rho_{V}\right\|_{\mathrm{Tr}}\leq 2^{-r}.

Here $U=\begin{pmatrix}1/2&0\\ 0&1/2\end{pmatrix}$ is the uniform operator over one bit, and $\rho_{M(A,X)XV}$ is the classical-quantum system constructed by adding a new classical register storing the value of $M(A,X)$ and tracing out $A$ . In other words,

\rho_{M(A,X)XV}=\bigoplus_{b\in\{-1,1\},x\in\mathcal{X}}\sum_{\begin{subarray}{c}a\in\mathcal{A}\\ M(a,x)=b\end{subarray}}\rho_{V|a,x}.

Notice that if we choose $V$ to be trivial, the above inequality immediately implies that $\left|\mathop{\mathbf{E}\mbox{}}\limits[M(A,X)]\right|\leq 2^{-r}$ .

As an example, the results in [KK12] imply that the inner product function on $n$ bits, where $\mathcal{A}=\mathcal{X}=\mathbb{F}_{2}^{n}$ and

M(a,x)=(-1)^{a\cdot x},

is an $X$ -strong $(k,n-k)$ -quantum extractor for every $2\leq k\leq n$ .

In this section we prove the following theorem:

Theorem 3.

Let $\mathcal{X},\mathcal{A}$ be two finite sets with $n=\log_{2}|\mathcal{X}|$ . Let $M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}$ be a matrix which is a $X$ -strong $(q,r)$ -quantum extractor. Let $\rho$ be a branching program for the learning problem corresponding to $M$ , described by classical-quantum systems $\rho^{(t)}_{XV}$ , with $q/2$ -qubit quantum memory $V$ and length $T$ , and without classical memory. Then the success probability of $\rho$ is at most

2^{-n}+8T\sqrt{n+q}\cdot 2^{-r/4}.

To prove the theorem, we first need to define the following measure of dependency:

Definition.

Let $\rho_{XV}$ be a classical-quantum system over classical $X$ and quantum $V$ . The dependency of $V$ on $X$ in $\rho_{XV}$ is defined as

\xi^{\rho}(X;V)=\min_{\tau_{V}}\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}

where $\tau_{V}$ is taken over all density operators on $V$ . Notice that in this definition taking $\tau_{V}=\rho_{V}$ is almost optimal as we have

\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}\leq\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}+\big{\|}\rho_{V}-\tau_{V}\big{\|}_{\mathrm{Tr}}\leq 2\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}.

(16)

We also consider the quantum mutual information between $X$ and $V$ in $\rho_{XV}$ , defined as

\mathbf{I}_{\rho}\left(X;V\right)=\mathbf{S}(\rho_{X})+\mathbf{S}(\rho_{V})-\mathbf{S}(\rho_{XV})=\mathbf{S}\left(\rho_{XV}\mathrel{\|}\rho_{X}\otimes\rho_{V}\right),

where $\mathbf{S}(\cdot)$ denotes the von Neumann entropy, and $\mathbf{S}(\cdot\mathrel{\|}\cdot)$ denotes quantum relative entropy. When $V$ consists of $q$ qubits, we have the following relationship between our dependency measure and quantum mutual information:

Lemma C.1.

$\displaystyle\frac{1}{2}\xi^{\rho}(X;V)^{2}\leq\mathbf{I}_{\rho}\left(X;V\right)\leq q\cdot\xi^{\rho}(X;V)+2\sqrt{\xi^{\rho}(X;V)}$ .

Proof.

On one hand, using the inequality on quantum relative entropy and trace distance (see e.g. [OP04, Theorem 1.15]), we have

\mathbf{I}_{\rho}\left(X;V\right)=\mathbf{S}\left(\rho_{XV}\mathrel{\|}\rho_{X}\otimes\rho_{V}\right)\geq\frac{1}{2}\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}^{2}_{\mathrm{Tr}}\geq\frac{1}{2}\xi^{\rho}(X;V)^{2}.

On the other hand, Fannes-Audenaert inequality [Aud07] tells us that for every $x\in\mathcal{X}$ , the difference between the von Neumann entropies of any two states $\rho$ and $\tau$ on $V$ is bounded by

|\mathbf{S}(\rho)-\mathbf{S}(\tau)|\leq q\cdot\frac{1}{2}\big{\|}\rho-\tau\big{\|}_{\mathrm{Tr}}+h\left(\frac{1}{2}\big{\|}\rho-\tau\big{\|}_{\mathrm{Tr}}\right)

where $h(\epsilon)=-\epsilon\log_{2}\epsilon-(1-\epsilon)\log_{2}(1-\epsilon)$ is the binary entropy function. Since the state of $V$ conditioned on $X=x$ is $\rho_{V|x}/\Pr[X=x]=2^{n}\rho_{V|x}$ , we have

	$\displaystyle\mathbf{I}_{\rho}\left(X;V\right)$	$\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}\left[\mathbf{S}(\rho_{V})-\mathbf{S}(2^{n}\rho_{V\|x})\right]$
		$\displaystyle\leq\frac{1}{2}q\cdot\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}\big{\\|}\rho_{V}-2^{n}\rho_{V\|x}\big{\\|}_{\mathrm{Tr}}+\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}h\left(\frac{1}{2}\big{\\|}\rho_{V}-2^{n}\rho_{V\|x}\big{\\|}_{\mathrm{Tr}}\right)$
		$\displaystyle\leq\frac{1}{2}q\cdot\big{\\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\\|}_{\mathrm{Tr}}+h\left(\frac{1}{2}\big{\\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\\|}_{\mathrm{Tr}}\right)$
		$\displaystyle\leq\frac{1}{2}q\cdot\big{\\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\\|}_{\mathrm{Tr}}+\sqrt{2\big{\\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\\|}_{\mathrm{Tr}}},$

as $h$ is concave and $h(\epsilon)\leq 2\sqrt{\epsilon}$ . Now let $\tau_{V}$ be the optimal density operator in the definition of $\xi^{\rho}(X;V)$ . Plugging in Eq. 16, we conclude that

\mathbf{I}_{\rho}\left(X;V\right)\leq q\cdot\xi^{\rho}(X;V)+2\sqrt{\xi^{\rho}(X;V)}.\qed

Lemma C.2.

For every classical-quantum system $\rho_{AXV}$ with the $q$ -qubit quantum subsystem $V$ that depends only on $A$ , we have

\mathbf{I}_{\rho}\left(X;M(A,X),V\right)\leq 2(n+q)\cdot 2^{-r/2}.

Proof.

Since $\mathbf{I}_{\rho}\left(X;V\right)=0$ , it suffices to bound $\mathbf{I}_{\rho}\left(X;M(A,X)\mid V\right)\leq\mathbf{I}_{\rho}\left(M(A,X);X,V\right)$ . To bound the later, we first notice that since $M$ is a strong $(q,r)$ -quantum extractor,

	$\displaystyle\xi^{\rho}(M(A,X);X,V)$	$\displaystyle\leq\left\\|\rho_{M(A,X)XV}-\rho_{M(A,X)}\otimes\rho_{X}\otimes\rho_{V}\right\\|_{\mathrm{Tr}}$
		$\displaystyle\leq\left\\|\rho_{M(A,X)XV}-U\otimes\rho_{X}\otimes\rho_{V}\right\\|_{\mathrm{Tr}}+\left\|\mathop{\mathbf{E}\mbox{}}\limits[M(A,X)]\right\|$
		$\displaystyle\leq 2\cdot 2^{-r}.$

As the total dimension of $X$ and $V$ is $2^{n+q}$ , by Lemma C.1 we have

	$\displaystyle\mathbf{I}_{\rho}\left(X;M(A,X),V\right)$	$\displaystyle\leq\mathbf{I}_{\rho}\left(M(A,X);X,V\right)$
		$\displaystyle\leq(n+q)\cdot\xi^{\rho}(M(A,X);X,V)+2\sqrt{\xi^{\rho}(M(A,X);X,V)}$
		$\displaystyle\leq 5(n+q)\cdot 2^{-r/2}.\qed$

Lemma C.3.

For every classical-quantum system $\rho_{AXV}$ with $q/2$ -qubit quantum subsystem $V$ that depends only on $A$ and $M(A,X)$ , we have

\xi^{\rho}(X;V)\leq 4\sqrt{n+q}\cdot 2^{-r/4}.

Proof.

Let $W=\rho_{a,0}\otimes\rho_{a,1}$ , where $\rho_{a,b}$ is the density matrix of $V$ when $A=a$ and $M(A,X)=b$ . Then $W$ is a $q$ -bit quantum system that depends only on $A$ . Since $V$ can be decided from $M(A,X)$ and $W$ , we have

\xi^{\rho}(X;V)^{2}\leq 2\mathbf{I}_{\rho}\left(X;V\right)\leq 2\mathbf{I}_{\rho}\left(X;M(A,X)),W\right)\leq 10(n+q)\cdot 2^{-r/2}.\qed

We are now ready to prove Theorem 3. Let $\Phi_{t,a,b}$ be the quantum channel applied on $V$ at stage $t$ with sample $(a,b)$ , and recall that the evolution of the system $\rho^{(t)}_{XV}$ can be expressed as

\rho^{(t+1)}_{XV}=\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{V|x}\big{)}\right].

Proof of Theorem 3.

We are going to bound the increment of $\xi_{t}$ , which is the shorthand for $\xi^{\rho^{(t)}}(X;V)$ . For now let us focus on some stage $t$ , and let $\tau$ be the density operator that minimizes $\xi_{t}=\big{\|}\rho^{(t)}_{XV}-\rho^{(t)}_{X}\otimes\tau\big{\|}_{\mathrm{Tr}}$ . Notice that $\rho^{(t)}_{X}=\rho_{X}=2^{-n}\mathbb{I}_{X}$ for every $t$ .

Since $\tau$ is a fixed quantum state, we can prepare $\tau$ and apply $\Phi_{A,M(A,X)}$ on $\tau$ to obtain a new quantum register $V^{\prime}$ , which depends only on $A$ and $M(A,X)$ . Notice that

\rho_{XV^{\prime}}=\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}(\tau)\right],

and therefore similarly to Lemma 4.14 (that evolution does not increase trace distance), we can show that

	$\displaystyle\left\\|\rho^{(t+1)}_{XV}-\rho_{XV^{\prime}}\right\\|_{\mathrm{Tr}}$	$\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\sum_{x\in\mathcal{X}}\left\\|\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{V\|x}\big{)}-\Phi_{t,a,M(a,x)}(\tau)\right\\|_{\mathrm{Tr}}$
		$\displaystyle\leq\sum_{x\in\mathcal{X}}\big{\\|}\rho^{(t)}_{V\|x}-\tau\big{\\|}_{\mathrm{Tr}}\leq\left\\|\rho^{(t)}_{XV}-\rho_{X}\otimes\tau\right\\|_{\mathrm{Tr}}=\xi_{t}.$

Hence we have

	$\displaystyle\xi_{t+1}$	$\displaystyle\leq\left\\|\rho^{(t+1)}_{XV}-\rho_{X}\otimes\rho_{V^{\prime}}\right\\|_{\mathrm{Tr}}$
		$\displaystyle\leq\left\\|\rho^{(t+1)}_{XV}-\rho_{XV^{\prime}}\right\\|_{\mathrm{Tr}}+\big{\\|}\rho_{XV^{\prime}}-\rho_{X}\otimes\rho_{V^{\prime}}\big{\\|}_{\mathrm{Tr}}$
		$\displaystyle\leq\xi_{t}+2\xi^{\rho}(X;V^{\prime})$
		$\displaystyle\leq\xi_{t}+8\sqrt{n+q}\cdot 2^{-r/4}.$

Since $\xi_{0}=0$ , we conclude that

\xi_{T}\leq 8T\sqrt{n+q}\cdot 2^{-r/4}.

This value bounds the difference of the success probability of $\rho$ , and that of a quantum branching program whose memory is independent of $X$ . The later is clearly at most $2^{-n}$ , which finishes the proof. ∎

	$\displaystyle\Phi(\rho)_{X\|v,w}=\Phi^{\prime}(\rho)_{X\|v,w}$	$\displaystyle=(\mathbb{I}_{X}\otimes\langle v,w\|)\Phi^{\prime}(\rho)(\mathbb{I}_{X}\otimes\|v,w\rangle)$
		$\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}(\mathbb{I}_{X}\otimes\langle v,w\|E_{w^{\prime},k})\rho_{XV\|w^{\prime}}(\mathbb{I}_{X}\otimes E_{w^{\prime},k}^{\dagger}\|v,w\rangle)$
		$\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}\big{\\|}E_{w^{\prime},k}^{\dagger}\|v,w\rangle\big{\\|}_{2}\cdot\rho_{X\|v^{\prime},w^{\prime}}$

	$\displaystyle\big{\\|}\rho_{XV\|w}-\rho^{\|G}_{XV\|w}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle\leq\sqrt{4\mathrm{Tr}[\rho]^{2}-4\mathrm{Tr}[\Pi\rho]^{2}}$
		$\displaystyle\leq\sqrt{8(\mathrm{Tr}[\rho]-\mathrm{Tr}[\Pi\rho])\mathrm{Tr}[\rho]}$
		$\displaystyle=\sqrt{8\sum_{i=1}^{d}\mathrm{Tr}[\rho_{X\|v_{i},w}]\mathrm{Tr}[\rho_{XV\|w}]}$
		$\displaystyle\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X\|v_{i},w}]\mathrm{Tr}[\rho_{XV\|w}]}.\qed$

	$\displaystyle\mathrm{Tr}[\tau^{(t,\circ)}_{X\|v,w}]$	$\displaystyle<(2^{r}-1)\cdot\left(\mathrm{Tr}[\tau^{(t,\star)}_{X\|v,w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{X\|v,w}]\right)$
		$\displaystyle\leq(2^{r}-1)\cdot 2^{-5r}\cdot(1-2^{-5r})^{-1}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{XV\|w}]$
		$\displaystyle<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X\|w}].\qed$

	$\displaystyle\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XVW}-\tau^{(t,\infty)}_{XVW}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle=\sum_{w\in\mathcal{W}}\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XV\|w}-\tau^{(t,\infty)}_{XV\|w}\big{\\|}_{\mathrm{Tr}}$
		$\displaystyle\leq\sum_{w\in\mathcal{W}}\Pr_{a\in_{R}\mathcal{A}}\left[\exists\|v\rangle\in\mathcal{V},\|\langle M_{a},P^{\tau^{(t,\infty)}}_{X\|v,w}\rangle\|\geq 2^{-r}\right]\cdot\mathrm{Tr}[\tau^{(t,\infty)}_{XV\|w}]$
		$\displaystyle\leq 2^{-2r}\cdot\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\infty)}_{XV\|w}]\leq 2^{-2r}.\qed$

	$\displaystyle\big{\\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\\|}_{\mathrm{Tr}}$	$\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left\\|\sum_{x\in\mathcal{X}}\|x\rangle\langle x\|\otimes\left(\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW\|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW\|x}\big{)}\right)\right\\|_{\mathrm{Tr}}$
		$\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\left\\|\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW\|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW\|x}\big{)}\right\\|_{\mathrm{Tr}}$
		$\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\big{\\|}\tau^{(t,a)}_{VW\|x}-\rho^{(t)}_{VW\|x}\big{\\|}_{\mathrm{Tr}}$
		$\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\\|}\tau^{(t,a)}_{XVW}-\rho^{(t)}_{XVW}\big{\\|}_{\mathrm{Tr}}.\qed$

Memory-Sample Lower Bounds for Learning with Classical-Quantum Hybrid Memory

Abstract

1 Introduction

1.1 Our Results

Theorem 1 (Main Theorem, Informal).

Implications to Cryptography in the Bounded-Storage Model.

2 Proof Overview

2.1 Recap of Proofs for Classical Lower Bounds

2.2 Badness Levels

2.3 Obstacles for Proving Quantum Lower Bounds

Example.

2.4 Hybrid Memory Lower Bounds with Small Quantum Memory

ℓ∞\ell_{\infty} Truncation.

Badness Levels.

3 Preliminaries

3.1 Vectors and Matrices

Viewing a Learning Problem as a Matrix

Extractors

3.2 Anti-Concentration Bound for Quadratic Form on Unit Vectors

Lemma 3.1.

Proof.

3.3 Multipartite Quantum Systems

3.4 Classical-Quantum Systems

Claim 3.2.

Proof.

Characterization of operators over classical-quantum hybrid systems.

Claim 3.3.

Proof.

3.5 Branching Program with Hybrid Memory

4 Main Result

Theorem 2.

4.1 Truncated Classical-Quantum Systems

Lemma 4.1.

Proof.

Proposition 4.2.

Lemma 4.3.

Proof.

Corollary 4.4.

4.2 Truncated Branching Program

4.3 Bounding the Truncation Difference

4.3.1 Truncation by G2G_{2}

Lemma 4.5.

Proof.

Corollary 4.6.

Proof.

4.3.2 Truncation by G∞G_{\infty}

Lemma 4.7.

Proof.

Corollary 4.8.

Proof.

Lemma 4.9.

Proof.

Corollary 4.10.

Proof.

4.3.3 Truncation by GaG_{a}

Lemma 4.11.

Proof.

Lemma 4.12.

Proof.

Corollary 4.13.

Proof.

4.3.4 Evolution preserves trace distance

Lemma 4.14.

Proof.

4.4 Proof of Theorem 2

Proof.

5 Proof of Lemma 4.5

5.1 Target Distribution and Badness

Lemma 5.1.

Proof.

Lemma 5.2.

Proof.

5.2 Badness Levels

Lemma 5.3.

Proof.

Lemma 5.4.

Proof.

Proof for Lemma 4.5.

Acknowledgement

References

$\ell_{\infty}$ Truncation.

4.3.1 Truncation by $G_{2}$

4.3.2 Truncation by $G_{\infty}$

4.3.3 Truncation by $G_{a}$