This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Memory-Sample Lower Bounds for Learning with Classical-Quantum Hybrid Memory

Qipeng Liu Simons Institute for Theory of Computing. E-mail: [email protected]. Research supported in part by the Simons Institute for the Theory of Computing, through a Quantum Postdoctoral Fellowship, by the DARPA SIEVE-VESPA grant No.HR00112020023 and by the NSF QLCI program through grant number OMA-2016245.    Ran Raz Princeton University. E-mail: [email protected]. Research supported by a Simons Investigator Award and by the National Science Foundation grants No. CCF-1714779, CCF-2007462.    Wei Zhan Princeton University. E-mail: [email protected]. Research supported by a Simons Investigator Award and by the National Science Foundation grants No. CCF-1714779, CCF-2007462.
Abstract

In a work by Raz (J. ACM and FOCS 16), it was proved that any algorithm for parity learning on nn bits requires either Ω(n2)\Omega(n^{2}) bits of classical memory or an exponential number (in nn) of random samples. A line of recent works continued that research direction and showed that for a large collection of classical learning tasks, either super-linear classical memory size or super-polynomially many samples are needed. All these works consider learning algorithms as classical branching programs, which perform classical computation within bounded memory.

However, these results do not capture all physical computational models, remarkably, quantum computers and the use of quantum memory. It leaves the possibility that a small piece of quantum memory could significantly reduce the need for classical memory or samples and thus completely change the nature of the classical learning task. Despite the recent research on the necessity of quantum memory for intrinsic quantum learning problems like shadow tomography and purity testing, the role of quantum memory in classical learning tasks remains obscure.

In this work, we study classical learning tasks in the presence of quantum memory. We prove that any quantum algorithm with both, classical memory and quantum memory, for parity learning on nn bits, requires either Ω(n2)\Omega(n^{2}) bits of classical memory or Ω(n)\Omega(n) bits of quantum memory or an exponential number of samples. In other words, the memory-sample lower bound for parity learning remains qualitatively the same, even if the learning algorithm can use, in addition to the classical memory, a quantum memory of size cncn (for some constant c>0c>0).

Our result is more general and applies to many other classical learning tasks. Following previous works, we represent by the matrix M:A×X{1,1}M:A\times X\to\{-1,1\} the following learning task. An unknown xx is sampled uniformly at random from a concept class XX, and a learning algorithm tries to uncover xx by seeing streaming of random samples (ai,bi=M(ai,x))(a_{i},b_{i}=M(a_{i},x)) where for every ii, aiAa_{i}\in A is chosen uniformly at random. Assume that k,,rk,\ell,r are integers such that any submatrix of MM of at least 2k|A|2^{-k}\cdot|A| rows and at least 2|X|2^{-\ell}\cdot|X| columns, has a bias of at most 2r2^{-r}. We prove that any algorithm with classical and quantum hybrid memory for the learning problem corresponding to MM needs either (1) Ω(k)\Omega(k\cdot\ell) bits of classical memory, or (2) Ω(r)\Omega(r) qubits of quantum memory, or (3) 2Ω(r)2^{\Omega(r)} random samples, to achieve a success probability at least 2O(r)2^{-O(r)}.

Our results refute the possibility that a small amount of quantum memory significantly reduces the size of classical memory needed for efficient learning on these problems. Our results also imply improved security of several existing cryptographical protocols in the bounded-storage model (protocols that are based on parity learning on nn bits), proving that security holds even in the presence of a quantum adversary with at most cn2cn^{2} bits of classical memory and cncn bits of quantum memory (for some constant c>0c>0).

1 Introduction

Memory plays an important role in learning. Starting from the seminal works by Shamir [Sha14] and Steinhardt, Valiant and Wager [SVW16], a sequence of works initiates and deepens the study of lower bounds for learning under memory constraints. Steinhardt, Valiant, and Wager [SVW16] conjectured that in order to learn an unknown nn-bit string from samples of random-subset parity, an algorithm needs either memory-size quadratic in nn or exponentially many random samples (also in nn). This conjecture was later on proved by Raz [Raz18], showing for the first time that for some learning problems, super-linear memory size is required for efficient learning. This result was then generalized to a broad class of learning problems [KRT17, Raz17, MM18, BGY18, GRT18, SSV19, GKR20, GKLR21].

Although we have a comprehensive understanding of the (in)feasibility of learning under limitations on particular computation resource (memory), the previous works mentioned above do not capture all physical computational models; most notably, quantum computation and the power of quantum memory. Many researchers believe that large-scale quantum computers will eventually become viable. Recent experiments demonstrated quantum advantages, for example [AAB+19], and suggested that there are possibly no fundamental barriers to achieving quantum memory and quantum computers. Questions on the role of quantum memory in learning were proposed by Wright in the context of general state tomography [Wri16] and by Aaronson for shadow tomography [Aar18]. A line of works [HHJ+16, BCL20, HKP20, CCHL22, ACQ22, CLO22] pioneer the idea and show either polynomial or exponential separations for learning with/without quantum memory, but all for intrinsic quantum learning tasks like state tomography, shadow tomography and purity testing.

In light of the above, it is appealing to consider classical learning tasks in the presence of quantum memory, as well as hybrid classical-quantum memory. A direct implication of all aforementioned classical results only gives trivial results. As kk qubits of memory can always be efficiently simulated by 2k\sim 2^{k} classical bits, we can only conclude (say, for parity learning) that either 2logn\sim 2\log n-qubit quantum memory or exponentially many samples are needed. Prior to our work, it could have been the case that even if only a very small size quantum memory was available, it might have significantly reduced the need for classical memory and led to an efficient learning algorithm.

In this work, we prove memory-sample lower bounds in the presence of hybrid memory for a wide collection of classical learning problems. As in [Raz17, GRT18], we will represent a learning problem by a matrix M:A×X{1,1}M:A\times X\to\{-1,1\} whose columns correspond to concepts in the concept class XX and rows correspond to random samples. In the learning task, an unknown concept xXx\in X is sampled uniformly at random and each random sample is given as (ai,bi)=(ai,M(ai,x))(a_{i},b_{i})=(a_{i},M(a_{i},x)) for a uniformly picked aiAa_{i}\in A. The learner’s goal is to uncover xx. In [GRT18], it is proved that when the underlying matrix MM is a (k,)(k,\ell)-L2L_{2} two source extractor111Roughly speaking, this means that every submatrix MM^{\prime} of MM with number of rows at least 2k|A|2^{-k}|A| and number of columns at least 2|X|2^{-\ell}|X| has a relative bias at most 2r2^{-r}. with error 2r2^{-r}, a learning algorithm requires either Ω(k)\Omega(k\cdot\ell) bits of memory or 2Ω(r)2^{\Omega(r)} samples to achieve a success probability at least 2O(r)2^{-O(r)} for the learning task.

1.1 Our Results

In this work, we model a quantum learning algorithm as a program with hybrid memory consisting of qq qubits of quantum memory and mm bits of classical memory. At each stage, a random sample (ai,bi=M(ai,x))(a_{i},b_{i}=M(a_{i},x)) is given to the algorithm. The quantum learning algorithm applies an arbitrary quantum channel to the hybrid memory, controlled by the random sample. Although the channel can be arbitrary, we impose the outcome to be a hybrid classical-quantum state of at most qq qubits and mm bits. We stress that there is no limitation on the complexity of the quantum channel (and this only makes our results stronger as we are proving here lower bounds for such algorithms).

With the above model, we give the following main theorem.

Theorem 1 (Main Theorem, Informal).

Let M:A×X{1,1}M:A\times X\to\{-1,1\} be a matrix. If MM is a (k,)(k,\ell)-L2L_{2} two source extractor with error 2r2^{-r}, a quantum learning algorithm requires either

  1. 1.

    Ω(k)\Omega(k\cdot\ell) bits of classical memory; or,

  2. 2.

    Ω(r)\Omega(r) qubits of quantum memory; or,

  3. 3.

    2Ω(r)2^{\Omega(r)} samples,

to succeed with a probability of at least 2O(r)2^{-O(r)} in the corresponding learning task.

Our main theorem implies that for many learning problems, the availability of a quantum memory of size up to Ω(r)\Omega(r), does not reduce the size of classical memory or the number of samples that are needed. As coherent quantum memory is challenging for near-term intermediate-scale quantum computers and is probably expensive even if and when quantum computers are widely viable, the impact of quantum memory is further limited for these learning problems.

To make the theorem more precise, let us take parity learning as an example. The above theorem says that a quantum learning algorithm needs either Ω(n2)\Omega(n^{2}) bits of memory, or Ω(n)\Omega(n) qubits of quantum memory, to conduct efficient learning; otherwise, it requires 2Ω(n)2^{\Omega(n)} random samples. At first glance, it seems that the constraint on quantum memory is trivial: if the target is to learn an nn-bit unknown secret, a linear amount of memory always seems necessary to store the secret. However, noticing that our main theorem applies to quantum learning algorithms with hybrid memory and rules out algorithms with n2/1000n^{2}/1000 bits and n/1000n/1000 qubits of hybrid memory for parity learning, the main theorem yields non-trivial and compelling memory-sample lower bounds. Note also that our results (and previous results) are valid even if the goal is to output only one bit of the secret. Currently, we do not know whether our main theorem is tight. For parity learning, we are not aware of any quantum learning algorithm that uses only O(n)O(n) qubits of quantum memory. We leave closing the gap as a fascinating open question.

The main theorem naturally applies to other learning problems considered in [GRT18], including learning sparse parities, learning from sparse linear equations, and many others. We do not present an exhaustive list here but refer the readers to [GRT18] for more details.

Along the way, we propose a new approach for proving the classical memory-sample lower bounds. We call this approach, the “badness levels” method. The approach is technically equivalent to the previous approach in [Raz17, GRT18] but is conceptually simpler to work with and we are able to lift it to the quantum case.

We note that proving a linear lower bound on the size of the quantum memory, without classical memory, is significantly simpler (but to the best of our knowledge such a proof has not appeared prior to our work). We present such a proof in Appendix C. In Appendix C, we state and prove Theorem 3 that shows a simpler proof for a linear lower bound on the quantum-memory size (without classical memory). While Theorem 3 is qualitatively weaker than our main result in most cases, as it only gives a lower bound for programs with only quantum memory but without a (possibly quadratic) classical memory, Theorem 3 is technically incomparable and is stated in terms of quantum extractors, rather than classical extractors. Additionally, the proof of Theorem 3 is significantly simpler than the proof of our main theorem.

Implications to Cryptography in the Bounded-Storage Model.

Since learning theory and cryptography can be viewed as two sides of the same coin, our theorem also lifts the security of many existing cryptographical protocols in the bounded-storage model (protocols that are based on parity learning) to the quantum setting. To our best knowledge, these are the first proofs of classical cryptographical protocols being secure against space-bounded quantum attackers.222On the other hand, there are known examples of classically-secure bounded-storage protocols that are breakable with an exponentially smaller amount of quantum memory. [GKK+08]. We elaborate more below.

Cryptography in the (classical) bounded storage model was first proposed by Maurer [Mau92]. In such a model, no computational assumption is needed. Honest execution is performed through a long streaming sequence of bits. Eavesdroppers have bounded storage and limited capability of storing conversations, thus cannot break the protocol. A line of works [CM97, AR99, ADR02, DR02, DM02, Lu02, DM04, MSTS04, HN06, DQW21, DQW22, …] builds efficient and secure protocols for key agreement, oblivious transfer, bit commitment and time stamping in that model.

Based on the memory-sample lower bounds for parity learning of nn bits, [Raz18] suggested an encryption scheme in the bounded-storage model. Guan and Zhandry [GZ19] proposed key agreement, oblivious transfer and bit commitment with improved rounds and better correctness, against attackers with up to O(n2)O(n^{2}) bits of memory. Following a similar idea, Liu and Vusirikala [LV21] showed that semi-honest multiparty computation could be achieved against attackers with up to O(n2)O(n^{2}) bits of memory. More recently, Dodis, Quach, and Wichs [DQW22] considered message authentication in the bounded storage model based on parity learning. Our result on parity learning gives a direct lift on all the results above. When the cryptographic protocols are based on parity learning of nn bits (often treated as a security parameter), our result shows that security holds even in the presence of a quantum adversary with at most O(n2)O(n^{2}) bits of classical memory and O(n)O(n) qubits of quantum memory.

Despite many previous works on cryptography in the quantum bounded storage model [DFR+07, DFSS07, Sch07, DFSS08, WW08, PMLA13, BY21], they all rely on streaming quantum states. Our memory-sample lower bounds give for the first time a rich class of classical cryptographical schemes (key agreement, oblivious transfer, and bit commitment) secure against space-bounded quantum attackers.

2 Proof Overview

2.1 Recap of Proofs for Classical Lower Bounds

Since our proof builds on the previous line of works on classical memory-sample lower bounds for learning, specifically, on the proof technique of [Raz17, GRT18], we provide a brief review of these proofs, using parity learning [Raz18] as an example. In below, M(a,x)M(a,x) denotes the inner product of aa and xx in 𝔽2\mathbb{F}_{2}.

Consider a classical branching program that tries to learn an unknown and uniformly random x{0,1}nx\in\{0,1\}^{n} from samples (a,b)(a,b), where a{0,1}na\in\{0,1\}^{n} is uniformly random and b=M(a,x)b=M(a,x). We can associate every state vv of the branching program with a distribution PX|vP_{X|v} over {0,1}n\{0,1\}^{n}, indicating the distribution of xx conditioned on reaching that state. At the initial state, without any information about xx, the distribution is uniform (which has the smallest possible 2\ell_{2}-norm). Along a computational path on the branching program, the distribution PX|vP_{X|v} evolves and eventually gets concentrated (with large 2\ell_{2}-norms) in order to output xx correctly. Therefore, during the evolution, PX|vP_{X|v} should at some stage have mildly large 2\ell_{2}-norms (2εn2^{\varepsilon n} times larger than uniform for some small constant ε>0\varepsilon>0). If we set such a distribution as a target, the distribution is hard to achieve with random samples. Only with 2Ω(n)2^{-\Omega(n)} probability, the branching program can make significant progress towards the target; while most of the time a sample just splits the distributions (both the current and the target distribution) into two even parts, and that does not help much in getting closer to the target distribution (with large 2\ell_{2} norm).

To put it more rigorously, we examine the evolution of the inner product

PX|v,P=x{0,1}nPX|v(x)P(x)\langle P_{X|v},P\rangle=\sum_{x\in\{0,1\}^{n}}P_{X|v}(x)\cdot P(x)

between the distribution PX|vP_{X|v} on the current state vv, and a target distribution PP. Receiving a sample (a,b)(a,b) implies that M(a,x)=bM(a,x)=b, hence only the part of PX|vP_{X|v} supported on such xx proceeds. If this part is close to 12\frac{1}{2} probability, we say that aa divides PX|vP_{X|v} evenly. Denoting the new distribution as PX|v(a,b)P_{X|v}^{(a,b)}, after proper normalization the new inner product is

PX|v(a,b),P=x{0,1}nM(a,x)=bPX|v(x)P(x)/x{0,1}nM(a,x)=bPX|v(x).\langle P_{X|v}^{(a,b)},P\rangle=\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x)\cdot P(x)\bigg{/}\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x). (1)

Ideally, both PX|vP_{X|v} and the point-wise product vector PX|vPP_{X|v}\cdot P should have reasonably small 2\ell_{2}-norms. Due to the extractor property of MM, most of a{0,1}na\in\{0,1\}^{n} should divide both vectors evenly, and thus the denominator is close to 12\frac{1}{2} while the enumerator is close to 12PX|v,P\frac{1}{2}\langle P_{X|v},P\rangle. That means, given a uniformly random aa, we get limited progress on the inner product. On the other hand, from U,P=2n\langle U,P\rangle=2^{-n} with uniform distribution UU to P,P=22εn2n\langle P,P\rangle=2^{2\varepsilon n}\cdot 2^{-n}, the branching program needs to make multiple steps of progression. Therefore it happens with an extremely small probability.

To ensure that the above statement goes smoothly, we require the following properties for every state vv in the branching program:

  • The 2\ell_{2}-norm PX|v2\big{\|}P_{X|v}\big{\|}_{2} is small.

  • The 2\ell_{2}-norm PX|vP2\big{\|}P_{X|v}\cdot P\big{\|}_{2} is small, which is implied when the \ell_{\infty}-norm PX|v\big{\|}P_{X|v}\big{\|}_{\infty} is small.

  • The denominator in Eq. 1 is bounded away from 0 for every sample (a,b)(a,b).

These properties do not hold by themselves. Instead, we execute a truncation procedure on the branching program before choosing a target distribution. More specifically, the branching program is modified so that it stops whenever it:

  • (2\ell_{2} truncation): Reaches a state vv with large PX|v2\big{\|}P_{X|v}\big{\|}_{2};

  • (\ell_{\infty} truncation): Reaches a state vv with large PX|v(x)P_{X|v}(x) when the unknown concept is xx;

  • (Sample truncation): Or, for the next sample (a,b)(a,b), aa does not divide PX|vP_{X|v} evenly.

It turns out that after 2\ell_{2} truncation, the other two truncation steps add 2Ω(n)2^{-\Omega(n)} error in each stage of the branching program. Therefore the proof boils down to proving a 2Ω(n2)2^{-\Omega(n^{2})} bound on the probability of reaching a state with large PX|v2\big{\|}P_{X|v}\big{\|}_{2}, from which by a standard union bound, we can prove the memory-sample lower bounds for parity learning: either 2Ω(n)2^{\Omega(n)} samples or Ω(n2)\Omega(n^{2}) bits of memory are necessary.

2.2 Badness Levels

As mentioned above, to bound the probability of reaching a state with a large 2\ell_{2}-norm, the basic idea is to fix its distribution as the target distribution PP, and bound the increment of the inner product PX|v,P\langle P_{X|v},P\rangle. This was done in [Raz17, GRT18] by designing a potential function that tracks the average of PX|v,Pk\langle P_{X|v},P\rangle^{k} for some k=Θ(n)k=\Theta(n), where the average is over states vv in the same stage of the branching program. Here we propose another approach using the concept of badness levels. Although it is technically equivalent to the potential function approach in the classical case, it is more pliable and easier to be adapted to the quantum case. We view this approach as a separate contribution of our work.

We first define a bad event to be a pair (v,a)(v,a) of the state vv and the upcoming part of the sample aa, such that PX|v,P2n\langle P_{X|v},P\rangle\geq 2^{-n}, and for one of the two possible outcomes bb,

x{0,1}nM(a,x)=bPX|v(x)P(x)(12+2δn)PX|v,P\sum_{\begin{subarray}{c}x\in\{0,1\}^{n}\\ M(a,x)=b\end{subarray}}P_{X|v}(x)\cdot P(x)\geq\left(\frac{1}{2}+2^{-\delta n}\right)\cdot\langle P_{X|v},P\rangle (2)

with some small constant δ\delta. In other words, the inner product PX|v,P\langle P_{X|v},P\rangle is large enough, while not being divided evenly by aa. From Eq. 1 we know that the inner product gets at most roughly doubled through a bad event. In contrast, in the good case, the inner product either gets a mere (1+2δn)(1+2^{-\delta n}) multiplicative factor or is already smaller than the baseline 2n2^{-n}. Also, the extractor property of MM ensures that for every state vv, over uniformly random aa, the bad event happens with at most 2Ω(n)2^{-\Omega(n)} probability.

Now, the badness level β(v)\beta(v) of a state vv keeps track of how many times the computational path went through bad events before reaching vv.333For now we think of β(v)\beta(v) as a natural number. In the actual proof, β(v)\beta(v) is a distribution on natural numbers, as for different computational paths reaching the same state, the count of bad events can be different. The above observations on the bad events imply that (omitting the smaller factors):

  • For every state vv, PX|v,P\langle P_{X|v},P\rangle is bounded by 2β(v)2n2^{\beta(v)}\cdot 2^{-n};

  • Heading to the next stage, β(v)\beta(v) increases by 11 with probability 2Ω(n)2^{-\Omega(n)}.

Therefore at each stage, the total weight of states with badness level β\beta is at most 2Ω(βn)2^{-\Omega(\beta n)}. Thus any state with PX|v,P22εn2n\langle P_{X|v},P\rangle\geq 2^{2\varepsilon n}\cdot 2^{-n} must have 2Ω(n2)2^{-\Omega(n^{2})} probability.

2.3 Obstacles for Proving Quantum Lower Bounds

In this section, we present an attempt to prove the same 2Ω(n)2^{\Omega(n)}-sample or Ω(n2)\Omega(n^{2})-quantum-memory lower bound for the pure quantum case. Along the way we identify some obstacles to proving memory-sample lower bounds for quantum learning algorithms, and in the next section we show how to overcome these obstacles while proving lower bounds for hybrid learning algorithms, with quadratic-size classical-memory and linear-size quantum-memory.

Following the same framework as the above described proof for the classical case, we first need to transfer all the notions to a quantum algorithms:

  • The state vv is a quantum state in the Hilbert space of quantum memory;

  • The distribution PX|vP_{X|v} is still well-defined: It is the distribution of xx when the quantum memory is measured to vv (see Section 3.4 and Eq. 3);

  • We are still able to implement 2\ell_{2} truncation: If PX|vP_{X|v} has large 2\ell_{2}-norm, project the entire system to the orthogonal subspace vv^{\perp} of vv and repeat, until there is no such state vv (see Section 4.1 for details).

  • We are also able to implement sample truncation, in a similar manner to 2\ell_{2} truncation. As the criteria here depends on aa, we separately create a copy of the current system for each aa, truncate the states vv using projection when PX|vP_{X|v} is not evenly divided by aa in each copy, and then merge them back together. We prove that the error introduced by this truncation is small.

Here comes the first major obstacle: \ell_{\infty} truncation. In the classical case, \ell_{\infty} truncation is implemented for each individual xx, in contrast to 2\ell_{2} truncation where the states are removed altogether. Relying on the fact that it is already known that the 2\ell_{2} norm of the distribution is small, using Markov inequality, one can prove that the error introduced by the \ell_{\infty} truncation is small.

However, when we try to emulate the classical implementation of \ell_{\infty} truncation with quantum truncation, that is, to only project to vv^{\perp} the system conditioned on the specific xx where PX|v(x)P_{X|v}(x) is large, instead of for every xx, it may lead to huge changes to the distributions PX|uP_{X|u} on states uu non-orthogonal to vv. The following example illustrates such a scenario:

Example.

Consider a quantum learning algorithm, and assume that at some stage of the computation, for each x{0,1}nx\in\{0,1\}^{n}, the quantum memory is in some pure state v(x)v(x). We pick each v(x)v(x) uniformly at random in a Hilbert space of dimension d2n/2d\approx 2^{n/2} and consider a typical configuration of v(x)v(x). Now the 2\ell_{2}-norms are bounded for every quantum state vv: the worst ones happen when v=v(x)v=v(x) for some xx, where PX|v(x)2\|P_{X|v(x)}\|_{2} is typically around d2nd\cdot 2^{-n}, close to the 2\ell_{2}-norm of uniform distribution. However, those worst distributions also have \ell_{\infty}-norms close to d2nd\cdot 2^{-n}, which is much larger than the \ell_{\infty}-norm of the uniform distribution, and needs to be truncated. But truncating v(x)v(x) off for xx means that xx is completely erased, and we end up removing everything.

Moving on, we fix a target state vv with a target distribution PX|vP_{X|v} which exceeds the 2\ell_{2}-norm threshold, and the goal is again to prove a 2Ω(n2)2^{-\Omega(n^{2})} amplitude bound on vv. The bad event should still be defined as a pair (v,a)(v,a) satisfying Eq. 2, with vv now being a quantum state. We then run into the second major obstacle: it is not clear how to define badness levels.

If we define the badness level β(v)\beta(v) for each state vv individually by examining the bad events over the historical states, then it is not clear how to measure the total weight of a badness level β\beta. In the classical case, we simply define the total weight as the total probability of states with badness level β\beta. But here in the quantum case, it turns out that such a definition either depends on the choice of basis, which might have large increment in each stage, or completely fails to imply the desired amplitude bound on the target state.

The other choice is to have a more operational definition of badness levels, and it is indeed tempting to define β\beta as another register whose updates are controlled by the quantum memory. The problem with such definitions is that the bad event (Eq. 2) is not linear in vv. Therefore an operational definition of badness level, which is a linear operator, inevitably introduces error that escalates fast with the number of stages.

2.4 Hybrid Memory Lower Bounds with Small Quantum Memory

The obstacles in the previous section are for proving quadratic quantum memory lower bound. We note that proving linear quantum memory lower bound (without classical memory) is not hard: the proof can be entirely information theoretical, as with very limited memory, say, 12n\frac{1}{2}n qubits, the information gained from each sample is exponentially small, despite the memory being quantum. We present such a proof in Appendix C.

The lower bounds that we prove here are with hybrid memory: To learn parity with both classical and quantum memory, an algorithm needs either 2Ω(n)2^{\Omega(n)} samples, or Ω(n2)\Omega(n^{2}) classical memory, or Ω(n)\Omega(n) quantum memory (Theorem 1). We now describe how we overcome the previously mentioned obstacles.

\ell_{\infty} Truncation.

When there is only small quantum memory and no classical memory, the treatment for \ell_{\infty} truncation is straightforward. We remove all quantum states vv with distributions of large \ell_{\infty}-norm, by projecting the system to the orthogonal subspace vv^{\perp}, just like the process of 2\ell_{2} truncation. As the overall distribution on xx is uniform, any state vv with PX|v2δn2n\|P_{X|v}\|_{\infty}\geq 2^{\delta n}\cdot 2^{-n} must have weight at most 2δn2^{-\delta n}. Therefore, as long as the dimension of the Hilbert space is much smaller than δn\delta n, the error introduced in this truncation is small. 444The example in the previous section that shows the infeasibility of treating \ell_{\infty} truncation the same way as 2\ell_{2} truncation does not work here, as it requires n/2n/2 qubits of memory while here we have a smaller memory size.

With classical memory in presence, the actual \ell_{\infty} truncation step (see Section 4.2, Step 2) is more complicated. We first apply the original classical \ell_{\infty} truncation on the classical memory WW. Now that PX|w\|P_{X|w}\|_{\infty} is bounded for each classical memory state ww, we can remove the quantum states vv with large PX|v,w\|P_{X|v,w}\|_{\infty} by projection as stated above. Since the classical \ell_{\infty} truncation depends on xx, it could change the distributions PX|v,wP_{X|v,w}. However, as in the classical case, PX|wP_{X|w} will not change a lot. Thus, wherever PX|v,wP_{X|v,w} changes drastically, it must have a small weight and can also be removed by projection. This removal corresponds to truncation by GtG_{t} in Section 4.2.

Badness Levels.

Interestingly, we are able to avoid the problems of defining the badness level on quantum memory altogether, by keeping it a property on the classical memory only. To do so we need to alter the definition of a bad event: it is now a pair (w,a)(w,a) of classical memory state ww and sample aa, such that there exists some quantum memory state vv with PX|v,wP_{X|v,w} satisfying Eq. 2.

For each fixed classical memory state ww, we still need to ensure that bad events happen with a small probability. We prove it (Lemma 5.2) by showing that, if there are many different samples aa, each associated with some quantum state vav_{a} satisfying Eq. 2, then there is some quantum state vv that simultaneously satisfies Eq. 2 with most of such aa (which is impossible because of the extractor property). This is ultimately due to the continuous nature of Eq. 2: Under some proper congruent transformation, Eq. 2 becomes a simple threshold inequality on quadratic forms over vv. Now if it is satisfied by some vav_{a}, it is going to be satisfied by most vv for a much smaller threshold parameter δ\delta, and hence the existence of a simultaneously satisfying vv.555We note that the error bound for sample truncation (Lemma 4.12) is also proved using this argument. In this argument, we use Lemma 3.1, which is derived from the anti-concentration bound for Gaussian quadratic forms, and crucially relies on the fact that the dimension is at most 2εn2^{\varepsilon n} for some small ε\varepsilon.

Another technical problem is that to use the extractor property, we need to ensure that PX|v,w,P2n\langle P_{X|v,w},P\rangle\geq 2^{-n} for the simultaneously satisfying vv. Thus, what we do in Lemma 5.2 is to first conceptually remove the parts where PX|v,w,P\langle P_{X|v,w},P\rangle is too small, using projection similarly to the truncation steps. After the removal, we are left with a subspace 𝒱\mathcal{V}^{\prime} where PX|v,w,P\langle P_{X|v,w},P\rangle is always lower bounded, and we show that for every state vv that satisfies Eq. 2, the inequality is still close to being satisfied after projecting vv onto 𝒱\mathcal{V}^{\prime}. Therefore we could still apply the above argument and find a simultaneously satisfying vv within the subspace.

3 Preliminaries

3.1 Vectors and Matrices

For a vector vdv\in\mathbb{C}^{d} and p[1,]p\in[1,\infty], we define the p\ell_{p} norm of vv as

vp=(i=1d|vi|p)1/p.\|v\|_{p}=\left(\sum_{i=1}^{d}|v_{i}|^{p}\right)^{1/p}.

For two vectors u,vdu,v\in\mathbb{C}^{d}, define their inner product as u,v=uv=i=1dui¯vi\langle u,v\rangle=u^{\dagger}v=\sum_{i=1}^{d}\overline{u_{i}}v_{i}. So v22=v,v\|v\|_{2}^{2}=\langle v,v\rangle. We also view every distribution PP over a set 𝒳\mathcal{X} as a non-negative real vector with P1=1\|P\|_{1}=1.

We specifically use Dirac notation to denote unit vectors, |vd|v\rangle\in\mathbb{C}^{d} implies that |v2=1\||v\rangle\|_{2}=1. For a non-zero vector udu\in\mathbb{C}^{d}, let |vu|v\rangle\sim u be the normalization of uu, that is, |v=u/u2|v\rangle=u/\|u\|_{2}.

For every vector vdv\in\mathbb{C}^{d}, let Diagvd×d\operatorname{Diag}v\in\mathbb{C}^{d\times d} be the diagonal matrix whose diagonal entries represent vv. Conversely, for every square matrix MM, let diagM\operatorname{diag}M be the vector consisting of the diagonal entries of MM. For a matrix (or generally a linear operator) MM, we use MTr\|M\|_{\mathrm{Tr}} and M2\|M\|_{2} to denote its trace norm and spectral norm respectively, that is,

MTr=Tr[MM],M2=maxv0Mv2/v2.\big{\|}M\big{\|}_{\mathrm{Tr}}=\mathrm{Tr}\left[\sqrt{MM^{\dagger}}\right],\quad\big{\|}M\big{\|}_{2}=\max_{v\neq 0}\|Mv\|_{2}/\|v\|_{2}.

For an Hermitian Md×dM\in\mathbb{C}^{d\times d}, we say it is a positive semi-definite operator if for every vdv\in\mathbb{C}^{d}, vMv0v^{\dagger}Mv\geq 0. A (partial) density operator is a positive semi-definite operator with its trace being 11 (or at most 11, respectively).

Viewing a Learning Problem as a Matrix

Let M:𝒜×𝒳{1,1}M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\} be a matrix. The matrix MM corresponds to the following learning problem. There is an unknown element x𝒳x\in\mathcal{X} that was chosen uniformly at random. A learner tries to learn xx from samples (a,b)(a,b), where a𝒜a\in\mathcal{A} is chosen uniformly at random and b=M(a,x)b=M(a,x). That is, the learning algorithm is given a stream of samples, (a1,b1),(a2,b2),(a_{1},b_{1}),(a_{2},b_{2}),\ldots, where each ata_{t} is uniformly distributed and for every tt, bt=M(at,x)b_{t}=M(a_{t},x). For each a𝒜a\in\mathcal{A}, we use Ma:𝒳{1,1}M_{a}:\mathcal{X}\rightarrow\{-1,1\} to denote the vector corresponding to the aa-th row of MM.

Extractors

A matrix M:𝒜×𝒳{1,1}M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\} with n=log2|𝒳|n=\log_{2}|\mathcal{X}| is a (k,)(k,\ell)-L2L_{2} extractor with error 2r2^{-r}, if for every distribution PP over 𝒳\mathcal{X} with P222n/2\|P\|_{2}\leq 2^{\ell}\cdot 2^{-n/2}, there are at most 2k|𝒜|2^{-k}\cdot|\mathcal{A}| rows a𝒜a\in\mathcal{A} such that

|Ma,P|2r.|\langle M_{a},P\rangle|\geq 2^{-r}.

3.2 Anti-Concentration Bound for Quadratic Form on Unit Vectors

Lemma 3.1.

There exists an absolute constant cc such that following holds. Let σ\sigma be a Hermitian operator over the Hilbert space 𝒱=d\mathcal{V}=\mathbb{C}^{d}, and let vv be a uniformly random unit vector in 𝒱\mathcal{V}. Then for every ε>0\varepsilon>0, we have

Pr[|vσv|εσ2d]cε+ed.\Pr\left[|v^{\dagger}\sigma v|\leq\frac{\varepsilon\|\sigma\|_{2}}{d}\right]\leq c\sqrt{\varepsilon}+e^{-d}.
Proof.

Let g=(g1,,gd)𝒩(0,1)dg=(g_{1},\ldots,g_{d})\sim\mathcal{N}(0,1)^{d} be standard Gaussians. Notice that g22\|g\|_{2}^{2} follows χd2\chi_{d}^{2}-distribution, and |gσg|/g22|g^{\dagger}\sigma g|/\|g\|_{2}^{2} is equidistributed as |vσv||v^{\dagger}\sigma v|. Therefore by union bound we have

Prv[|vσv|εσ2d]=Prg[|gσg|εσ2g22d]Prg[|gσg|5εσ2]+Prg[g225d].\Pr_{v}\left[|v^{\dagger}\sigma v|\leq\frac{\varepsilon\|\sigma\|_{2}}{d}\right]=\Pr_{g}\left[|g^{\dagger}\sigma g|\leq{\varepsilon\|\sigma\|_{2}}\cdot\frac{\|g\|^{2}_{2}}{d}\right]\leq\Pr_{g}\left[|g^{\dagger}\sigma g|\leq 5\varepsilon\|\sigma\|_{2}\right]+\Pr_{g}\Big{[}\|g\|_{2}^{2}\geq 5d\Big{]}.

For the first term, notice that Var[gσg]=2Tr[σ2]\mathrm{Var}[g^{\dagger}\sigma g]=2\mathrm{Tr}[\sigma^{2}] (see e.g. [RS08, Chapter 5]) which is no smaller than 2σ222\|\sigma\|_{2}^{2}. Therefore, by Carbery–Wright inequality [CW01], there exists an absolute constant cc such that

Pr[|gσg|5εσ2]Pr[|gσg|4εVar[gσg]1/2]cε.\Pr\left[|g^{\dagger}\sigma g|\leq 5\varepsilon\|\sigma\|_{2}\right]\leq\Pr\left[|g^{\dagger}\sigma g|\leq 4\varepsilon\mathrm{Var}[g^{\dagger}\sigma g]^{1/2}\right]\leq c\sqrt{\varepsilon}.

For the second term, the standard Laurent-Massart bound on χ2\chi^{2}-distrbitions [LM00] gives:

Pr[g225d]ed.\Pr\Big{[}\|g\|_{2}^{2}\geq 5d\Big{]}\leq e^{-d}.\qed

3.3 Multipartite Quantum Systems

The state of qq qubits can be represented in a Hilbert space 𝒱=(2)q=2q\mathcal{V}=(\mathbb{C}^{2})^{\otimes q}=\mathbb{C}^{2^{q}}. In a product of mm Hilbert spaces 𝒱[m]=𝒱1𝒱m\mathcal{V}_{[m]}=\mathcal{V}_{1}\otimes\cdots\otimes\mathcal{V}_{m}, a multipartite partial system V1,,VmV_{1},\ldots,V_{m} is represented by a partial density operator ρV[m]\rho_{V_{[m]}}. For a subset I[m]I\subseteq[m] of indices, the subsystem on {Vi}iI\{V_{i}\}_{i\in I} (or VIV_{I} for short) is defined by tracing out jIj\notin I, that is,

ρVI=TrVjI[ρV[m]].\rho_{V_{I}}=\mathrm{Tr}_{V_{j\notin I}}[\rho_{V_{[m]}}].

Now for any two disjoint subsets I,J[m]I,J\subset[m], given some |vJ𝒱J=jJ𝒱j|v_{J}\rangle\in\mathcal{V}_{J}=\bigotimes_{j\in J}\mathcal{V}_{j}, the conditional system on VIV_{I} is defined as

ρVI|vJ=(𝕀VIvJ|)ρVIJ(𝕀VI|vJ),\rho_{V_{I}|v_{J}}=\left(\mathbb{I}_{V_{I}}\otimes\langle v_{J}|\right)\rho_{V_{I\cup J}}\left(\mathbb{I}_{V_{I}}\otimes|v_{J}\rangle\right),

which is a partial density operator on VIV_{I}. Note that the trace

Tr[ρVI|vJ]=vJ|ρVJ|vJ\mathrm{Tr}\left[\rho_{V_{I}|v_{J}}\right]=\langle v_{J}|\rho_{V_{J}}|v_{J}\rangle

only depends on the system ρ\rho and |vJ|v_{J}\rangle, while being independent of the choice of II.

Another simple fact that will be repeatedly used later on is that for an orthogonal basis \mathcal{B} of VJV_{J}, we have

ρVI=TrVJ[ρVIJ]=|vJρVI|vJ.\rho_{V_{I}}=\mathrm{Tr}_{V_{J}}[\rho_{V_{I\cup J}}]=\sum_{|v_{J}\rangle\in\mathcal{B}}\rho_{V_{I}|v_{J}}.

3.4 Classical-Quantum Systems

In the underlying space 𝒱1𝒱m\mathcal{V}_{1}\otimes\cdots\otimes\mathcal{V}_{m} of the multipartite system, we say 𝒱i\mathcal{V}_{i} is classical if there is a fixed orthogonal basis i\mathcal{B}_{i} of 𝒱i\mathcal{V}_{i}, such that for every multipartite system ρV[m]\rho_{V_{[m]}}, every pair of distinct |vi|vii|v_{i}\rangle\neq|v_{i}^{\prime}\rangle\in\mathcal{B}_{i} and every two states |v,|vji𝒱j|v\rangle,|v^{\prime}\rangle\in\bigotimes_{j\neq i}\mathcal{V}_{j}, we have

vi,v|ρV[m]|vi,v=0.\langle v_{i},v|\rho_{V_{[m]}}|v_{i}^{\prime},v^{\prime}\rangle=0.

Without loss of generality, in the rest of the work we always assume i\mathcal{B}_{i} is the set of computational basis states. We also identify 𝒱i\mathcal{V}_{i} with the discrete set i\mathcal{B}_{i}, and remove the Dirac brackets when we talk about the classical elements in 𝒱i\mathcal{V}_{i}. In this case every multipartite system ρV[m]\rho_{V_{[m]}} can be written as a direct sum

ρV[m]=vi𝒱iρV[m]{i}|vi.\rho_{V_{[m]}}=\bigoplus_{v_{i}\in\mathcal{V}_{i}}\rho_{V_{[m]\setminus\{i\}}|v_{i}}.

The reader may find this direct sum viewpoint easier to handle in some later scenarios.

When VIV_{I} is classical, conditioned on any |vJ𝒱J|v_{J}\rangle\in\mathcal{V}_{J} with JJ disjoint from II, the system ρVI|vJ\rho_{V_{I}|v_{J}} is represented as a diagonal matrix on VIV_{I}. If Tr[ρVI|vJ]>0\mathrm{Tr}[\rho_{V_{I}|v_{J}}]>0, it induces a distribution over  the computation basis states of VIV_{I}, defined as

PVI|vJρ=diagρVI|vJ/Tr[ρVI|vJ].\displaystyle P^{\rho}_{V_{I}|v_{J}}=\operatorname{diag}\rho_{V_{I}|v_{J}}/\mathrm{Tr}[\rho_{V_{I}|v_{J}}]. (3)

In the rest of this paper, whenever we use this notation PVI|vJρP^{\rho}_{V_{I}|v_{J}}, it is always implicitly assumed that Tr[ρVI|vJ]>0\mathrm{Tr}[\rho_{V_{I}|v_{J}}]>0 and the distribution exists.

In this work we typically consider the following scenario: There is a quantum memory register VV ranging in the Hilbert space 𝒱\mathcal{V}, and a classical memory register WW ranging in the set of memory states 𝒲\mathcal{W}, along with some classical information X𝒳X\in\mathcal{X} (later in the work, it is the concept to be learned) that is correlated with VV and WW. We will make use of the following fact:

Claim 3.2.

Let ρXVW\rho_{XVW} be a classical-quantum system over classical X,WX,W and quantum VV. For every w𝒲w\in\mathcal{W}, PX|wρP^{\rho}_{X|w} is a convex combination of PX|v,wρP^{\rho}_{X|v,w} for some {|v}𝒱\{|v\rangle\}\subseteq\mathcal{V}.

Proof.

Let \mathcal{B} be an orthogonal basis of 𝒱\mathcal{V}, so that we have (from the end of last section)

ρX|w=|vρX|v,w.\rho_{X|w}=\sum_{|v\rangle\in\mathcal{B}}\rho_{X|v,w}.

Therefore PX|wρP^{\rho}_{X|w} is a linear combination of PX|v,wρP^{\rho}_{X|v,w} for |v|v\rangle\in\mathcal{B}, with non-negative coefficients. Since they are all distributions, it must be a convex combination. ∎

Characterization of operators over classical-quantum hybrid systems.

Now we identify all possible operators on the classical-quantum hybrid memory space 𝒱𝒲\mathcal{V}\otimes\mathcal{W}. A priori to the assumption that WW is classical, we think of a quantum channel operating on the system as working on the underlying space 𝒱|𝒲|\mathcal{V}\otimes\mathbb{C}^{|\mathcal{W}|}. Now we denote 𝒯𝒱𝒲\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}} to be the set of all such quantum channels Φ\Phi that satisfy the following: for every classical-quantum system ρVW\rho_{VW} in 𝒱𝒲\mathcal{V}\otimes\mathcal{W}, WW is still classical in Φ(ρVW)\Phi(\rho_{VW}). That is, for every two states |v,|v𝒱|v\rangle,|v^{\prime}\rangle\in\mathcal{V} and every pair of distinct w,w𝒲w,w^{\prime}\in\mathcal{W}, we have

v,w|Φ(ρVW)|v,w=0.\langle v,w|\Phi(\rho_{VW})|v^{\prime},w^{\prime}\rangle=0.

Note that not all channels in 𝒯𝒱𝒲\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}} are physically realizable. For instance, with one-bit classical memory and no quantum memory, the channel

(acc¯b)(aicic¯b)\begin{pmatrix}a&c\\ \overline{c}&b\end{pmatrix}\mapsto\begin{pmatrix}a&ic\\ -i\overline{c}&b\end{pmatrix}

is not a classical operator. However, since we are constrained to classical quantum systems, this channel is effectively equivalent to an identity channel on one-bit classical memory. Generally speaking, every channel in 𝒯𝒱𝒲\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}} is equivalent to a channel controlled by 𝒲\mathcal{W} that maps 𝒱\mathcal{V} to 𝒱𝒲\mathcal{V}\otimes\mathcal{W}. Below, we prove this observation and use it to show the following claim:

Claim 3.3.

Let ρXVW\rho_{XVW} be a classical-quantum system over classical X,WX,W and quantum VV. Let Φ𝒯𝒱𝒲\Phi\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}, and we use Φ(ρ)\Phi(\rho) to denote the system after applying Φ\Phi to VWVW and identity to XX. Then for every |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W}, PX|v,wΦ(ρ)P^{\Phi(\rho)}_{X|v,w} is a convex combination of PX|v,wρP^{\rho}_{X|v^{\prime},w^{\prime}} for some {|v}𝒱\{|v^{\prime}\rangle\}\subseteq\mathcal{V} and {w}𝒲\{w^{\prime}\}\subseteq\mathcal{W}.

One difference between 3.2 and 3.3 is that in 3.3 it is not always possible to write PX|v,wΦ(ρ)P^{\Phi(\rho)}_{X|v,w} as a convex combination of PX|v,wρP^{\rho}_{X|v^{\prime},w^{\prime}} for |v|v^{\prime}\rangle from an orthogonal basis of 𝒱\mathcal{V}. But it is always possible in 3.2. Although the difference does not matter in this work, we mention it here for clarity.

Proof.

Since Φ𝒯𝒱𝒲\Phi\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}}, the following channel is functionally equivalent to Φ\Phi for classical-quantum systems:

Φ:ρw𝒲Φ(ρV|w|ww|).\displaystyle\Phi^{\prime}:\rho\to\sum_{w\in\mathcal{W}}\Phi(\rho_{V|w}\otimes|w\rangle\langle w|).

The physical meaning of Φ\Phi^{\prime} is to measure WW under the computational basis (which should not change the functionality we care about) and apply Φ\Phi.

By defining the channel Φw():=Φ(|ww|)\Phi_{w}(\cdot):=\Phi(\cdot\otimes|w\rangle\langle w|), the above can be alternatively written as:

Φ:ρw𝒲Φw(ρV|w).\displaystyle\Phi^{\prime}:\rho\to\sum_{w\in\mathcal{W}}\Phi_{w}(\rho_{V|w}).

Now consider the Kraus representation of each Φw\Phi_{w}, that is, a finite set of linear operators Ew,k:𝒱𝒱𝒲E_{w,k}:\mathcal{V}\rightarrow\mathcal{V}\otimes\mathcal{W} such that

Φw(ρV|w)=kEw,kρV|wEw,k,kEw,kEw,k=𝕀V.\Phi_{w}(\rho_{V|w})=\sum_{k}E_{w,k}\rho_{V|w}E_{w,k}^{\dagger},\qquad\sum_{k}E_{w,k}^{\dagger}E_{w,k}=\mathbb{I}_{V}.

We can write

Φ(ρ)X|v,w=Φ(ρ)X|v,w\displaystyle\Phi(\rho)_{X|v,w}=\Phi^{\prime}(\rho)_{X|v,w} =(𝕀Xv,w|)Φ(ρ)(𝕀X|v,w)\displaystyle=(\mathbb{I}_{X}\otimes\langle v,w|)\Phi^{\prime}(\rho)(\mathbb{I}_{X}\otimes|v,w\rangle)
=w𝒲k(𝕀Xv,w|Ew,k)ρXV|w(𝕀XEw,k|v,w)\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}(\mathbb{I}_{X}\otimes\langle v,w|E_{w^{\prime},k})\rho_{XV|w^{\prime}}(\mathbb{I}_{X}\otimes E_{w^{\prime},k}^{\dagger}|v,w\rangle)
=w𝒲kEw,k|v,w2ρX|v,w\displaystyle=\sum_{w^{\prime}\in\mathcal{W}}\sum_{k}\big{\|}E_{w^{\prime},k}^{\dagger}|v,w\rangle\big{\|}_{2}\cdot\rho_{X|v^{\prime},w^{\prime}}

where in each term of the summation, |vEw,k|v,w|v^{\prime}\rangle\sim E_{w^{\prime},k}^{\dagger}|v,w\rangle. Similar to the arguments in 3.2, PX|v,wΦ(ρ)P^{\Phi(\rho)}_{X|v,w} is a convex combination of PX|v,wρP^{\rho}_{X|v^{\prime},w^{\prime}}. ∎

3.5 Branching Program with Hybrid Memory

For a learning problem that corresponds to the matrix MM, a branching program of hybrid memory with mm-bit classical memory, qq-qubit quantum memory and length TT is specified as follows.

At each stage 0tT0\leq t\leq T, the memory state of the branching program is described as a classical-quantum system ρVW(t)\rho^{(t)}_{VW} over quantum memory space 𝒱=(2)q\mathcal{V}=(\mathbb{C}^{2})^{\otimes q} and classical memory space 𝒲={0,1}m\mathcal{W}=\{0,1\}^{m}. The memory state evolves based on the samples that the branching program receives, and therefore depends on the unknown element xR𝒳x\in_{R}\mathcal{X}. We can then interpret the overall systems over XVWXVW, in which X{X} consists of an unknown concept xx, resulting in a classical-quantum system ρXVW(t)\rho^{(t)}_{XVW}. It always holds that the distribution of xx is uniform, i.e.,

ρX(t)=TrVW[ρXVW(t)]=12n𝕀X.\rho^{(t)}_{X}=\mathrm{Tr}_{VW}[\rho^{(t)}_{XVW}]=\frac{1}{2^{n}}\mathbb{I}_{X}.

Initially the memory VWVW is independent of XX and can be arbitrarily initialized. We assume that

ρXVW(0)=12n𝕀X12q𝕀V12m𝕀W.\rho^{(0)}_{XVW}=\frac{1}{2^{n}}\mathbb{I}_{X}\otimes\frac{1}{2^{q}}\mathbb{I}_{V}\otimes\frac{1}{2^{m}}\mathbb{I}_{W}.

At each stage 0t<T0\leq t<T, the branching program receives a sample (a,b)(a,b), where aR𝒜a\in_{R}\mathcal{A} and b=M(a,x)b=M(a,x), and applies an operation Φt,a,b𝒯𝒱𝒲\Phi_{t,a,b}\in\mathscr{T}_{\mathcal{V}\otimes\mathcal{W}} over its memory state. Thus the evolution of the entire system can be written as

ρXVW(t+1)=𝐄aR𝒜[x𝒳|xx|Φt,a,M(a,x)(ρVW|x(t))].\rho^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right].

Finally, at stage t=Tt=T, a measurement over the computational bases is applied on ρVW(T)\rho^{(T)}_{VW}, and the branching program outputs an element x~X\widetilde{x}\in X as a function of the measurement result (v,w){0,1}q+m(v,w)\in\{0,1\}^{q+m}. The success probability of the program is the probability that x~=x\widetilde{x}=x which can be formulated as

x𝒳,v{0,1}q,w𝒲x~(v,w)=xx,v,w|ρXVW(T)|x,v,w.\sum_{\begin{subarray}{c}x\in\mathcal{X},v\in\{0,1\}^{q},w\in\mathcal{W}\\ \widetilde{x}(v,w)=x\end{subarray}}\langle x,v,w|\rho^{(T)}_{XVW}|x,v,w\rangle.

4 Main Result

Theorem 2.

Let 𝒳,𝒜\mathcal{X},\mathcal{A} be two finite sets with n=log2|𝒳|n=\log_{2}|\mathcal{X}|. Let M:𝒜×𝒳{1,1}M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\} be a matrix which is a (k,)(k^{\prime},\ell^{\prime})-L2L_{2} extractor with error 2r2^{-r^{\prime}} for sufficiently large k,k^{\prime},\ell^{\prime} and rr^{\prime}, where n\ell^{\prime}\leq n. Let

r=min{14r,126+16,12(k1)}.r=\min\left\{\frac{1}{4}r^{\prime},\frac{1}{26}\ell^{\prime}+\frac{1}{6},\frac{1}{2}(k^{\prime}-1)\right\}.

Let ρ\rho be a branching program for the learning problem corresponding to MM, described by classical-quantum systems ρXVW(t)\rho^{(t)}_{XVW}, with qq-qubit quantum memory VV, mm-bit classical memory WW and length TT. If m144(k1)m\leq\frac{1}{44}(k^{\prime}-1)\ell^{\prime}, qr7q\leq r-7 and T2r2T\leq 2^{r-2}, the success probability of ρ\rho is at most O(2qr)O(2^{q-r}).

From now on we let k=k1k=k^{\prime}-1 and =15(13r2)\ell=\frac{1}{5}(\ell^{\prime}-13r-2). Then we have the following inequalities to be used later:

q+r+1r\displaystyle q+r+1-r^{\prime} 2r.\displaystyle\ \leq\ -2r. (4)
2+9rn\displaystyle 2\ell+9r-n r.\displaystyle\ \leq\ -r. (5)
(kr)\displaystyle(k-r)\ell 2m+4r+1.\displaystyle\ \geq\ 2m+4r+1. (6)

We leave a detailed calculation for the above inequalities in Appendix A.

4.1 Truncated Classical-Quantum Systems

Here we describe how to truncate a partial classical-quantum system ρXVW\rho_{XVW} according to some property G(v,w)G(v,w) of desire on ρX|v,w\rho_{X|v,w}. The goal is to remove the parts of ρXVW\rho_{XVW} where GG is not satisfied. We execute the following procedure:

  1. 1.

    Maintain a partial system ρXVW\rho^{\prime}_{XVW} initialized as ρXVW\rho_{XVW}, and subspaces 𝒱w𝒱\mathcal{V}_{w}\subseteq\mathcal{V} initialized as 𝒱\mathcal{V} for each w𝒲w\in\mathcal{W}.

  2. 2.

    Pick w𝒲w\in\mathcal{W} and |v𝒱w|v\rangle\in\mathcal{V}_{w} such that Tr[ρX|v,w]>0\mathrm{Tr}[\rho^{\prime}_{X|v,w}]>0 and G(v,w)G(v,w) is false.

  3. 3.

    Change the partial system ρXVW\rho^{\prime}_{XVW} into the following system by projection:

    (𝕀X(𝕀VW|v,wv,w|))ρXVW(𝕀X(𝕀VW|v,wv,w|)),\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v,w\rangle\langle v,w|)\big{)}\rho^{\prime}_{XVW}\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v,w\rangle\langle v,w|)\big{)},

    and change 𝒱w\mathcal{V}_{w} to its subspace orthogonal to |v|v\rangle, that is

    {|v𝒱wv|v=0}.\{|v^{\prime}\rangle\in\mathcal{V}_{w}\mid\langle v|v^{\prime}\rangle=0\}.
  4. 4.

    Repeat from step 2 until there is no such ww and |v|v\rangle. Denote the final system as ρXVW|G\rho^{|G}_{XVW}.

In step 2 we pick ww and |v|v\rangle arbitrarily as long as it satisfies the requirements, however we could always think of it as iterating over w𝒲w\in\mathcal{W} and processing each ρXV|w\rho_{XV|w} separately. The choices of |v|v\rangle for each ww do affect the final system ρXVW|G\rho^{|G}_{XVW}; Yet as we will see later, these choices are irrelevant to our proof.

Below, we give two useful lemmas on truncated systems.

Lemma 4.1.

For every |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W} such that Tr[ρX|v,w|G]>0\mathrm{Tr}[\rho^{|G}_{X|v,w}]>0, there exists |v|v^{\prime}\rangle in the remaining subspace 𝒱w\mathcal{V}_{w} such that

PX|v,wρ|G=PX|v,wρ=PX|v,wρ|G.P^{\rho^{|G}}_{X|v,w}=P^{\rho}_{X|v^{\prime},w}=P^{\rho^{|G}}_{X|v^{\prime},w}.
Proof.

It suffices to prove the lemma with one round of the truncation procedure executed. Suppose the |v1,w1|v_{1},w_{1}\rangle is picked in step 2, resulting in the partial system

ρXVW=(𝕀X(𝕀VW|v1,w1v1,w1|))ρXVW(𝕀X(𝕀VW|v1,w1v1,w1|)).\rho^{\prime}_{XVW}=\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v_{1},w_{1}\rangle\langle v_{1},w_{1}|)\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(\mathbb{I}_{VW}-|v_{1},w_{1}\rangle\langle v_{1},w_{1}|)\big{)}.

We can write

ρX|v,w\displaystyle\rho^{\prime}_{X|v,w} =(𝕀Xv,w|)ρXVW(𝕀X|v,w)\displaystyle=\big{(}\mathbb{I}_{X}\otimes\langle v,w|\big{)}\rho^{\prime}_{XVW}\big{(}\mathbb{I}_{X}\otimes|v,w\rangle\big{)}
=(𝕀X(v,w|v,w|v1,w1v1,w1|))ρXVW(𝕀X(|v,w|v1,w1v1,w1|v,w)).\displaystyle=\big{(}\mathbb{I}_{X}\otimes(\langle v,w|-\langle v,w|v_{1},w_{1}\rangle\langle v_{1},w_{1}|)\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(|v,w\rangle-|v_{1},w_{1}\rangle\langle v_{1},w_{1}|v,w\rangle)\big{)}.
  • If ww1w\neq w_{1}, then

    ρX|v,w=(𝕀Xv,w|)ρXVW(𝕀X|v,w)=ρX|v,w.\rho^{\prime}_{X|v,w}=\big{(}\mathbb{I}_{X}\otimes\langle v,w|\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes|v,w\rangle\big{)}=\rho_{X|v,w}.

    And the lemma holds directly by choosing |v=|v|v^{\prime}\rangle=|v\rangle.

  • If w=w1w=w_{1}, then with v1,w1|v,w=v1|v=λ\langle v_{1},w_{1}|v,w\rangle=\langle v_{1}|v\rangle=\lambda, we have

    ρX|v,w=(𝕀X(v|λ¯v1|)w|)ρXVW(𝕀X(|vλ|v1)|w).\rho^{\prime}_{X|v,w}=\big{(}\mathbb{I}_{X}\otimes(\langle v|-\overline{\lambda}\langle v_{1}|)\langle w|\big{)}\rho_{XVW}\big{(}\mathbb{I}_{X}\otimes(|v\rangle-\lambda|v_{1}\rangle)|w\rangle\big{)}.

    By the fact that Tr[ρX|v,w|G]>0\mathrm{Tr}[\rho^{|G}_{X|v,w}]>0, we must have |v|v1|v\rangle\neq|v_{1}\rangle. Therefore if we let |v|vλ|v1|v^{\prime}\rangle\sim|v\rangle-\lambda|v_{1}\rangle, which is the normalized projection of |v|v\rangle onto the orthogonal subspace of |v1|v_{1}\rangle, the above equality implies that PX|v,wρ=PX|v,wρP^{\rho^{\prime}}_{X|v,w}=P^{\rho}_{X|v^{\prime},w}. Meanwhile, since v1|v=0\langle v_{1}|v^{\prime}\rangle=0 we have ρX|v,w=ρX|v,w\rho^{\prime}_{X|v^{\prime},w}=\rho_{X|v^{\prime},w}, which completes the proof. ∎

A direct corollary of the above lemma is that if G(v,w)G(v,w) only depends on the distribution PX|v,wρP^{\rho}_{X|v,w}, then G(v,w)G(v,w) holds for every |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W} in the truncated system ρXVW|G\rho^{|G}_{XVW}, even when |v|v\rangle is not in the remaining subspace 𝒱w\mathcal{V}_{w}.

Our second lemma is based on the following fact that bounds the trace distance of a partial system and its projection, whose proof can be found in the Appendix B.

Proposition 4.2.

For every partial system ρ\rho and projection operator Π\Pi on ρ\rho, we have

ρΠρΠTr24Tr[ρ]24Tr[Πρ]2.\big{\|}\rho-\Pi\rho\Pi\big{\|}^{2}_{\mathrm{Tr}}\leq 4\mathrm{Tr}[\rho]^{2}-4\mathrm{Tr}[\Pi\rho]^{2}.
Lemma 4.3.

For each wWw\in W, let |v1,,|vd|v_{1}\rangle,\ldots,|v_{d}\rangle be the states picked in step 2 within 𝒱w\mathcal{V}_{w}. Then

ρXV|wρXV|w|GTr3i=1dTr[ρX|vi,w]Tr[ρXV|w].\big{\|}\rho_{XV|w}-\rho^{|G}_{XV|w}\big{\|}_{\mathrm{Tr}}\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X|v_{i},w}]\mathrm{Tr}[\rho_{XV|w}]}.
Proof.

In Proposition 4.2, take ρ\rho to be ρXV|w\rho_{XV|w}, and Π\Pi to be

𝕀Xi=1d(𝕀V|vivi|)=𝕀X(𝕀Vi=1d|vivi|).\mathbb{I}_{X}\otimes\prod_{i=1}^{d}\left(\mathbb{I}_{V}-|v_{i}\rangle\langle v_{i}|\right)=\mathbb{I}_{X}\otimes\left(\mathbb{I}_{V}-\sum_{i=1}^{d}|v_{i}\rangle\langle v_{i}|\right).

Then ΠρΠ=ρXV|w|G\Pi\rho\Pi=\rho^{|G}_{XV|w} and Tr[Πρ]=Tr[ρXV|w]i=1dTr[ρX|vi,w]\mathrm{Tr}[\Pi\rho]=\mathrm{Tr}[\rho_{XV|w}]-\sum_{i=1}^{d}\mathrm{Tr}[\rho_{X|v_{i},w}]. Therefore we have

ρXV|wρXV|w|GTr\displaystyle\big{\|}\rho_{XV|w}-\rho^{|G}_{XV|w}\big{\|}_{\mathrm{Tr}} 4Tr[ρ]24Tr[Πρ]2\displaystyle\leq\sqrt{4\mathrm{Tr}[\rho]^{2}-4\mathrm{Tr}[\Pi\rho]^{2}}
8(Tr[ρ]Tr[Πρ])Tr[ρ]\displaystyle\leq\sqrt{8(\mathrm{Tr}[\rho]-\mathrm{Tr}[\Pi\rho])\mathrm{Tr}[\rho]}
=8i=1dTr[ρX|vi,w]Tr[ρXV|w]\displaystyle=\sqrt{8\sum_{i=1}^{d}\mathrm{Tr}[\rho_{X|v_{i},w}]\mathrm{Tr}[\rho_{XV|w}]}
3i=1dTr[ρX|vi,w]Tr[ρXV|w].\displaystyle\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X|v_{i},w}]\mathrm{Tr}[\rho_{XV|w}]}.\qed

Since Tr[ρXV|w]1\mathrm{Tr}[\rho_{XV|w}]\leq 1 always holds, by summing over all w𝒲w\in\mathcal{W} we get the following corollary:

Corollary 4.4.

Let |v1,w1,,|vd,wd|v_{1},w_{1}\rangle,\ldots,|v_{d},w_{d}\rangle be all of the memory states picked in step 2. Then

ρXVWρXVW|GTr3i=1dTr[ρX|vi,wi].\big{\|}\rho_{XVW}-\rho^{|G}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\sum_{i=1}^{d}\sqrt{\mathrm{Tr}[\rho_{X|v_{i},w_{i}}]}.

4.2 Truncated Branching Program

The properties that we desire for the partial system ρXVW\rho_{XVW} consist of three parts:

  • Small L2L_{2} norm: Let G2(v,w)G_{2}(v,w) be the property that

    PX|v,wρ222n/2.\big{\|}P^{\rho}_{X|v,w}\big{\|}_{2}\leq 2^{\ell}\cdot 2^{-n/2}.
  • Small LL_{\infty} norm: Let G(v,w)G_{\infty}(v,w) be the property that

    PX|v,wρ22+9r2n.\big{\|}P^{\rho}_{X|v,w}\big{\|}_{\infty}\leq 2^{2\ell+9r}\cdot 2^{-n}.
  • Even division: For every a𝒜a\in\mathcal{A}, let Ga(v,w)G_{a}(v,w) be the property that

    |Ma,PX|v,wρ|2r.|\langle M_{a},P^{\rho}_{X|v,w}\rangle|\leq 2^{-r}.

Now we define the truncated branching program, by specifying the truncated partial classical-quantum system τXVW(t)\tau^{(t)}_{XVW} for each stage tt. Initially let τXVW(0)=ρXVW(0)\tau^{(0)}_{XVW}=\rho^{(0)}_{XVW}. For each stage 0tT0\leq t\leq T, the truncation consists of three ingredients (below we ignore the superscripts on PP for convenience):

  1. 1.

    Remove parts where PX|v,w2\big{\|}P_{X|v,w}\big{\|}_{2} is large. That is, let τXVW(t,)=τXVW(t)|G2\tau^{(t,\star)}_{XVW}=\tau^{(t)|G_{2}}_{XVW}.

  2. 2.

    Remove parts where PX|v,w\big{\|}P_{X|v,w}\big{\|}_{\infty} is large. This is done by two steps.

    • -

      First, let g{0,1}𝒳𝒲g\in\{0,1\}^{\mathcal{X}\otimes\mathcal{W}} be an indicator vector such that g(x,w)=1g(x,w)=1 if and only if

      Tr[τX|w(t,)]>0 and PX|wτ(t,)(x)22+5r2n.\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]>0\textrm{ and }P^{\tau^{(t,\star)}}_{X|w}(x)\leq 2^{2\ell+5r}\cdot 2^{-n}.

      Let τXVW(t,)=(gg𝕀V)τXVW(t,)(gg𝕀V)\tau^{(t,\circ)}_{XVW}=(gg^{\dagger}\otimes\mathbb{I}_{V})\tau^{(t,\star)}_{XVW}(gg^{\dagger}\otimes\mathbb{I}_{V}), where gggg^{\dagger} is the projection operator acting on 𝒳𝒲\mathcal{X}\otimes\mathcal{W}.

    • -

      To make sure that the distributions did not change a lot after the projection gggg^{\dagger}, for each 0t<T0\leq t<T, let Gt(v,w)G_{t}(v,w) be the property that

      Tr[τX|v,w(t,)](12r)Tr[τX|v,w(t,)].\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]\geq(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}].

      Let τXVW(t,)=τXVW(t,)|GGt\tau^{(t,\infty)}_{XVW}=\tau^{(t,\circ)|G_{\infty}\wedge G_{t}}_{XVW}.

  3. 3.

    For each a𝒜a\in\mathcal{A}, remove (only for this aa) parts where PX|v,wP_{X|v,w} is not evenly divided by aa. That is, for each a𝒜a\in\mathcal{A}, let τXVW(t,a)=τXVW(t,)|Ga\tau^{(t,a)}_{XVW}=\tau^{(t,\infty)|G_{a}}_{XVW}.

Then, if t<Tt<T, for each aR𝒜a\in_{R}\mathcal{A} we evolve the system by applying the sample operations Φt,a,b\Phi_{t,a,b} as the original branching program on τXVW(t,a)\tau^{(t,a)}_{XVW}, so that we have

τXVW(t+1)=𝐄aR𝒜[x𝒳|xx|Φt,a,M(a,x)(τVW|x(t,a))].\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

4.3 Bounding the Truncation Difference

In order to show that the success probability of the original branching program ρ(t)\rho^{(t)} is low, the plan is to prove an upper bound on the success probability of the truncated branching program τ(t)\tau^{(t)}, and bound the difference between the two probabilities.

Here we bound the difference by the trace distance between the two systems ρXVW(t)\rho^{(t)}_{XVW} and τXVW(t)\tau^{(t)}_{XVW}. We will show that the contribution to the trace distance from each one of the truncation ingredients is small, and in addition the evolution preserves the trace distance.

4.3.1 Truncation by G2G_{2}

Lemma 4.5.

For every 0tT0\leq t\leq T, |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W} such that G2(v,w)G_{2}(v,w) is violated (that is, PX|v,wτ(t)2>22n/2\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}>2^{\ell}\cdot 2^{-n/2}), we must have Tr[τX|v,w(t)]<22m24r\mathrm{Tr}[\tau^{(t)}_{X|v,w}]<2^{-2m}\cdot 2^{-4r}.

The lemma says, for any direction |v,w|v,w\rangle picked by the truncation procedure, the weight will be small and the truncation will not change the state significantly.

Proof.

This is our main technical lemma and we defer the proof to Section 5. ∎

Since there are at most 2q+m2^{q+m} such directions picked in the truncation procedure, we conclude the following corollary.

Corollary 4.6.

For every 0tT0\leq t\leq T, we have τXVW(t,)τXVW(t)Tr32q2r\big{\|}\tau^{(t,\star)}_{XVW}-\tau^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q-2r}.

Proof.

Recall that τXVW(t,)=τXVW(t)|G2\tau^{(t,\star)}_{XVW}=\tau^{(t)|G_{2}}_{XVW}. Since dim(𝒱𝒲)=2q+m\dim(\mathcal{V}\otimes\mathcal{W})=2^{q+m}, the truncation lasts for at most 2q+m2^{q+m} rounds. Since in every round the picked |v,w|v,w\rangle has Tr[τX|v,w(t)]<22m24r\mathrm{Tr}[\tau^{(t)}_{X|v,w}]<2^{-2m}\cdot 2^{-4r}, by Corollary 4.4 we have

τXVW(t,)τXVW(t)Tr32q+m22m24r=32q2r.\big{\|}\tau^{(t,\star)}_{XVW}-\tau^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q+m}\cdot\sqrt{2^{-2m}\cdot 2^{-4r}}=3\cdot 2^{q-2r}.\qed

4.3.2 Truncation by GG_{\infty}

Lemma 4.7.

For every 0tT0\leq t\leq T and w𝒲w\in\mathcal{W} we have

x𝒳g(x,w)=0PX|wτ(t,)(x)25r.\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)\leq 2^{-5r}.
Proof.

By 3.2, PX|wτ(t,)P^{\tau^{(t,\star)}}_{X|w} is a convex combination of PX|v,wτ(t,)P^{\tau^{(t,\star)}}_{X|v,w}. From Lemma 4.1 we know that G2(PX|v,wτ(t,))G_{2}(P^{\tau^{(t,\star)}}_{X|v,w}) holds for every |v|v\rangle and ww, and thus by convexity of 2\ell_{2}-norms we know that G2(PX|wτ(t,))G_{2}(P^{\tau^{(t,\star)}}_{X|w}) also holds. That means

𝐄xPX|wτ(t,)[PX|wτ(t,)(x)]=PX|wτ(t,)22222n.\mathop{\mathbf{E}\mbox{}}\limits_{x\sim P^{\tau^{(t,\star)}}_{X|w}}\left[P^{\tau^{(t,\star)}}_{X|w}(x)\right]=\big{\|}P^{\tau^{(t,\star)}}_{X|w}\big{\|}^{2}_{2}\leq 2^{2\ell}\cdot 2^{-n}.

Therefore, by Markov’s inequality we have

x𝒳g(x,w)=0PX|wτ(t,)(x)=PrxPX|wτ(t,)[PX|wτ(t,)(x)>22+5r2n]25r.\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)=\Pr_{x\sim P^{\tau^{(t,\star)}}_{X|w}}\left[P^{\tau^{(t,\star)}}_{X|w}(x)>2^{2\ell+5r}\cdot 2^{-n}\right]\leq 2^{-5r}.\qed
Corollary 4.8.

For every 0tT0\leq t\leq T and every w𝒲w\in\mathcal{W}, we have τXV|w(t,)τXV|w(t,)\tau^{(t,\circ)}_{XV|w}\leq\tau^{(t,\star)}_{XV|w}, and

Tr[τXV|w(t,)](125r)Tr[τXV|w(t,)].\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]\geq(1-2^{-5r})\cdot\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}].

Moreover, we have τXVW(t,)τXVW(t,)Tr25r\big{\|}\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 2^{-5r}.

Proof.

Since XX and WW are both classical and τXVW(t,)=(gg𝕀V)τXVW(t,)(gg𝕀V)\tau^{(t,\circ)}_{XVW}=(gg^{\dagger}\otimes\mathbb{I}_{V})\tau^{(t,\star)}_{XVW}(gg^{\dagger}\otimes\mathbb{I}_{V}), we have

τXV|w(t,)τXV|w(t,)=x𝒳g(x,w)=0|xx|τV|x,w(t,),\tau^{(t,\star)}_{XV|w}-\tau^{(t,\circ)}_{XV|w}=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}|x\rangle\langle x|\otimes\tau^{(t,\star)}_{V|x,w},

which is positive semi-definite. Recalling that (Equation 3)

Tr[τV|x,w(t,)]=x,w|τXW(t,)|x,w=diagτX|w(t,)(x)=PX|wτ(t,)(x)Tr[τX|w(t,)],\mathrm{Tr}[\tau^{(t,\star)}_{V|x,w}]=\langle x,w|\tau^{(t,\star)}_{XW}|x,w\rangle=\operatorname{diag}\tau^{(t,\star)}_{X|w}(x)=P^{\tau^{(t,\star)}}_{X|w}(x)\mathrm{Tr}[\tau^{(t,\star)}_{X|w}],

we have

Tr[τXV|w(t,)]Tr[τXV|w(t,)]=x𝒳g(x,w)=0Tr[τV|x,w(t,)]=x𝒳g(x,w)=0PX|wτ(t,)(x)Tr[τX|w(t,)]25rTr[τX|w(t,)].\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}\mathrm{Tr}[\tau^{(t,\star)}_{V|x,w}]=\sum_{\begin{subarray}{c}x\in\mathcal{X}\\ g(x,w)=0\end{subarray}}P^{\tau^{(t,\star)}}_{X|w}(x)\cdot\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]\leq 2^{-5r}\cdot\mathrm{Tr}[\tau^{(t,\star)}_{X|w}].

And therefore, as τXVW(t,)τXVW(t,)\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW} is positive semi-definite, we have

τXVW(t,)τXVW(t,)Tr=w𝒲Tr[τXV|w(t,)]Tr[τXV|w(t,)]25rw𝒲Tr[τX|w(t,)]25r.\big{\|}\tau^{(t,\circ)}_{XVW}-\tau^{(t,\star)}_{XVW}\big{\|}_{\mathrm{Tr}}=\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]\leq 2^{-5r}\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]\leq 2^{-5r}.\qed
Lemma 4.9.

For every 0tT0\leq t\leq T, |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W} such that G(v,w)G_{\infty}(v,w) is violated (that is, PX|v,wτ(t,)>22+9r2n\big{\|}P^{\tau^{(t,\circ)}}_{X|v,w}\big{\|}_{\infty}>2^{2\ell+9r}\cdot 2^{-n}) or Gt(v,w)G_{t}(v,w) is violated (that is, Tr[τX|v,w(t,)]<(12r)Tr[τX|v,w(t,)]\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}]), we must have Tr[τX|v,w(t,)]<224rTr[τX|w(t,)]\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}].

Proof.

If G(v,w)G_{\infty}(v,w) is violated, let x𝒳x\in\mathcal{X} be the one such that PX|v,wτ(t,)(x)>22+9r2nP^{\tau^{(t,\circ)}}_{X|v,w}(x)>2^{2\ell+9r}\cdot 2^{-n}. If g(x,w)=0g(x,w)=0 then PX|wτ(t,)(x)=0P^{\tau^{(t,\circ)}}_{X|w}(x)=0, while if g(x,w)=1g(x,w)=1 then by Corollary 4.8,

PX|wτ(t,)(x)Tr[τX|w(t,)]Tr[τX|w(t,)]22+5r2n(125r)122+5r2n.P^{\tau^{(t,\circ)}}_{X|w}(x)\leq\frac{\mathrm{Tr}[\tau^{(t,\star)}_{X|w}]}{\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]}\cdot 2^{2\ell+5r}\cdot 2^{-n}\leq(1-2^{-5r})^{-1}\cdot 2^{2\ell+5r}\cdot 2^{-n}.

Hence we always have

Tr[τX|v,w(t,)]PX|wτ(t,)(x)PX|v,wτ(t,)(x)Tr[τX|w(t,)]224rTr[τX|w(t,)],\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]\leq\frac{P^{\tau^{(t,\circ)}}_{X|w}(x)}{P^{\tau^{(t,\circ)}}_{X|v,w}(x)}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]\leq 2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}],

where the first inequality comes from the fact that τX|w(t,)τX|v,w(t,)\tau^{(t,\circ)}_{X|w}\geq\tau^{(t,\circ)}_{X|v,w} and Equation 3.

If Gt(v,w)G_{t}(v,w) is violated, since we know from Corollary 4.8 that

|Tr[τX|v,w(t,)]Tr[τX|v,w(t,)]|\displaystyle\left|\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]-\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}]\right| τXV|w(t,)τXV|w(t,)Tr25rTr[τXV|w(t,)]\displaystyle\leq\big{\|}\tau^{(t,\circ)}_{XV|w}-\tau^{(t,\star)}_{XV|w}\big{\|}_{\mathrm{Tr}}\leq 2^{-5r}\cdot\mathrm{Tr}[\tau^{(t,\star)}_{XV|w}]
25r(125r)1Tr[τXV|w(t,)],\displaystyle\leq 2^{-5r}\cdot(1-2^{-5r})^{-1}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}],

therefore from Tr[τX|v,w(t,)]<(12r)Tr[τX|v,w(t,)]\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}] we deduce that

Tr[τX|v,w(t,)]\displaystyle\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}] <(2r1)(Tr[τX|v,w(t,)]Tr[τX|v,w(t,)])\displaystyle<(2^{r}-1)\cdot\left(\mathrm{Tr}[\tau^{(t,\star)}_{X|v,w}]-\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]\right)
(2r1)25r(125r)1Tr[τXV|w(t,)]\displaystyle\leq(2^{r}-1)\cdot 2^{-5r}\cdot(1-2^{-5r})^{-1}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{XV|w}]
<224rTr[τX|w(t,)].\displaystyle<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}].\qed
Corollary 4.10.

For every 0tT0\leq t\leq T, we have τXVW(t,)τXVW(t,)Tr52q2r\big{\|}\tau^{(t,\infty)}_{XVW}-\tau^{(t,\circ)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 5\cdot 2^{q-2r}.

Proof.

Recall that τXVW(t,)=τXVW(t,)|GGt\tau^{(t,\infty)}_{XVW}=\tau^{(t,\circ)|G_{\infty}\wedge G_{t}}_{XVW}. For each w𝒲w\in\mathcal{W}, the truncation picks at most dim𝒱=2q\dim\mathcal{V}=2^{q} states |v,w|v,w\rangle, each with Tr[τX|v,w(t,)]<224rTr[τX|w(t,)]\mathrm{Tr}[\tau^{(t,\circ)}_{X|v,w}]<2\cdot 2^{-4r}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]. Therefore by applying Lemma 4.3 for each w𝒲w\in\mathcal{W}, we have

τXVW(t,)τXVW(t,)Tr3w𝒲2q224rTr[τX|w(t,)]52q2r.\big{\|}\tau^{(t,\infty)}_{XVW}-\tau^{(t,\circ)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 3\cdot\sum_{w\in\mathcal{W}}2^{q}\cdot\sqrt{2\cdot 2^{-4r}}\cdot\mathrm{Tr}[\tau^{(t,\circ)}_{X|w}]\leq 5\cdot 2^{q-2r}.\qed

4.3.3 Truncation by GaG_{a}

Notice that in the truncation step from τ(t,)\tau^{(t,\star)} to τ(t,)\tau^{(t,\circ)}, the distribution PX|v,wτ(t,)P^{\tau^{(t,\star)}}_{X|v,w} might change and not satisfy G2G_{2} anymore. However, with the truncation by GtG_{t}, any such distribution that changes too much is eliminated, and we have the following guarantee.

Lemma 4.11.

For every 0tT0\leq t\leq T, |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W}, we have

PX|v,wτ(t,)2(12r)122n/2.\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}.
Proof.

By Lemma 4.1, there exists |v𝒱|v^{\prime}\rangle\in\mathcal{V} such that PX|v,wτ(t,)=PX|v,wτ(t,)=PX|v,wτ(t,)P^{\tau^{(t,\infty)}}_{X|v,w}=P^{\tau^{(t,\infty)}}_{X|v^{\prime},w}=P^{\tau^{(t,\circ)}}_{X|v^{\prime},w}. The truncation by GtG_{t} ensures that Tr[τX|v,w(t,)](12r)Tr[τX|v,w(t,)]\mathrm{Tr}[\tau^{(t,\circ)}_{X|v^{\prime},w}]\geq(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v^{\prime},w}], and therefore

PX|v,wτ(t,)2=PX|v,wτ(t,)2=diagτX|v,w(t,)2Tr[τX|v,w(t,)]diagτX|v,w(t,)2(12r)Tr[τX|v,w(t,)](12r)122n/2.\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}=\big{\|}P^{\tau^{(t,\circ)}}_{X|v^{\prime},w}\big{\|}_{2}=\frac{\big{\|}\operatorname{diag}\tau^{(t,\circ)}_{X|v^{\prime},w}\big{\|}_{2}}{\mathrm{Tr}[\tau^{(t,\circ)}_{X|v^{\prime},w}]}\leq\frac{\big{\|}\operatorname{diag}\tau^{(t,\star)}_{X|v^{\prime},w}\big{\|}_{2}}{(1-2^{-r})\mathrm{Tr}[\tau^{(t,\star)}_{X|v^{\prime},w}]}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}.\qed
Lemma 4.12.

For every partial classical-quantum system τXV\tau_{XV} over 𝒳𝒱\mathcal{X}\otimes\mathcal{V} such that PX|vτ222n/2\big{\|}P^{\tau}_{X|v}\big{\|}_{2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2} holds for every |v𝒱|v\rangle\in\mathcal{V}, we have

PraR𝒜[|v𝒱,|Ma,PX|vτ|2r]22r.\Pr_{a\in_{R}\mathcal{A}}\left[\exists|v\rangle\in\mathcal{V},|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r}\right]\leq 2^{-2r}.
Proof.

Notice that we can think of τV=TrX[τXV]\tau_{V}=\mathrm{Tr}_{X}[\tau_{XV}] to be 𝕀V\mathbb{I}_{V}. This is because we can first assume that τV\tau_{V} is full rank (otherwise change 𝒱\mathcal{V} to its subspace and the conclusion in this lemma still holds), and if we have diagonalization QτVQ=𝕀VQ^{\dagger}\tau_{V}Q=\mathbb{I}_{V} for some non-singular QQ, then consider the new system

τXV=(𝕀XQ)τXV(𝕀XQ),\tau^{\prime}_{XV}=(\mathbb{I}_{X}\otimes Q^{\dagger})\tau_{XV}(\mathbb{I}_{X}\otimes Q),

and the set of distributions {PX|vτ}\{P^{\tau}_{X|v}\} and {PX|vτ}\{P^{\tau^{\prime}}_{X|v}\} over |v𝒱|v\rangle\in\mathcal{V} are the same, since PX|vτ=PX|vτP^{\tau^{\prime}}_{X|v}=P^{\tau}_{X|v^{\prime}} for |vQ|v|v^{\prime}\rangle\sim Q|v\rangle. With τV=𝕀V\tau_{V}=\mathbb{I}_{V} we have Tr[τX|v]=1\mathrm{Tr}[\tau_{X|v}]=1 for every |v𝒱|v\rangle\in\mathcal{V}, and thus PX|vτ=diagτX|vP^{\tau}_{X|v}=\operatorname{diag}\tau_{X|v}.

Let 𝒜𝒜\mathcal{A}^{\prime}\subseteq\mathcal{A} be the set of a𝒜a\in\mathcal{A} such that there exists |v𝒱|v\rangle\in\mathcal{V} with |Ma,PX|vτ|2r|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r}. For each a𝒜a\in\mathcal{A}^{\prime}, let

σa=TrX[(DiagMa𝕀V)τXV]\sigma_{a}=\mathrm{Tr}_{X}[(\operatorname{Diag}M_{a}\otimes\mathbb{I}_{V})\tau_{XV}]

which is a Hermitian operator on 𝒱\mathcal{V}. There exists |v𝒱|v\rangle\in\mathcal{V} such that

|v|σa|v|=|Ma,diagτX|v|=|Ma,PX|vτ|2r,|\langle v|\sigma_{a}|v\rangle|=|\langle M_{a},\operatorname{diag}\tau_{X|v}\rangle|=|\langle M_{a},P^{\tau}_{X|v}\rangle|\geq 2^{-r},

which means that σa22r\|\sigma_{a}\|_{2}\geq 2^{-r}. Now let |u|u\rangle be a uniformly random unit vector in 𝒱\mathcal{V}, and by Lemma 3.1 we know that for some absolute constant cc,

Pr|u[|u|σa|u|2r]12(q+rr)/2ce2q12rce11/2.\Pr_{|u\rangle}\left[|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}}\right]\geq 1-2^{(q+r-r^{\prime})/2}c-e^{-2^{q}}\geq 1-2^{-r}c-e^{-1}\geq 1/2.

The second last inequality comes from Eq. 4, while the last inequality is because of the assumption that rr is sufficiently large.

Since the above holds for every a𝒜a\in\mathcal{A}^{\prime}, it implies that Pra𝒜,|u[|u|σa|u|2r]\Pr_{a\in\mathcal{A}^{\prime},|u\rangle}[|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}}] is at least 1/21/2. It means that there exists some |u𝒱|u\rangle\in\mathcal{V} such that |u|σa|u|2r|\langle u|\sigma_{a}|u\rangle|\geq 2^{-r^{\prime}} for at least half of a𝒜a\in\mathcal{A}^{\prime}. On the other hand, since MM is a (k,)(k^{\prime},\ell^{\prime})-extractor with error 2r2^{-r^{\prime}}, and PX|uτ222n/2\big{\|}P^{\tau}_{X|u}\big{\|}_{2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2}, there are at most 2k2^{-k^{\prime}} fraction of a𝒜a\in\mathcal{A} such that |u|σa|u|=|Ma,PX|uτ|2r|\langle u|\sigma_{a}|u\rangle|=|\langle M_{a},P^{\tau}_{X|u}\rangle|\geq 2^{-r^{\prime}}. That means

PraR𝒜[a𝒜]22k22r.\Pr_{a\in_{R}\mathcal{A}}\left[a\in\mathcal{A}^{\prime}\right]\leq 2\cdot 2^{-k^{\prime}}\leq 2^{-2r}.

Here k12rk^{\prime}-1\geq 2r, by the definition of rr. ∎

Corollary 4.13.

For every 0tT0\leq t\leq T, we have 𝐄aR𝒜τXVW(t,a)τXVW(t,)Tr22r\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\tau^{(t,\infty)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 2^{-2r}.

Proof.

For each w𝒲w\in\mathcal{W}, the partial system τXV|w(t,)\tau^{(t,\infty)}_{XV|w} satisfies the condition of Lemma 4.12 since for every |v𝒱|v\rangle\in\mathcal{V},

PX|v,wτ(t,)2(12r)122n/222n/2.\big{\|}P^{\tau^{(t,\infty)}}_{X|v,w}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}\leq 2^{\ell^{\prime}}\cdot 2^{-n/2}.

Notice that for each a𝒜a\in\mathcal{A} such that there does not exist |v𝒱|v\rangle\in\mathcal{V} with |Ma,PX|v,wτ(t,)|2r|\langle M_{a},P^{\tau^{(t,\infty)}}_{X|v,w}\rangle|\geq 2^{-r} (that is, when Ga(v,w)G_{a}(v,w) holds for every |v𝒱|v\rangle\in\mathcal{V}), the sub system τXV|w(t,)\tau^{(t,\infty)}_{XV|w} is not touched in the truncation by GaG_{a} and we have τXV|w(t,a)=τXV|w(t,)\tau^{(t,a)}_{XV|w}=\tau^{(t,\infty)}_{XV|w}. Therefore

𝐄aR𝒜τXVW(t,a)τXVW(t,)Tr\displaystyle\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\tau^{(t,\infty)}_{XVW}\big{\|}_{\mathrm{Tr}} =w𝒲𝐄aR𝒜τXV|w(t,a)τXV|w(t,)Tr\displaystyle=\sum_{w\in\mathcal{W}}\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XV|w}-\tau^{(t,\infty)}_{XV|w}\big{\|}_{\mathrm{Tr}}
w𝒲PraR𝒜[|v𝒱,|Ma,PX|v,wτ(t,)|2r]Tr[τXV|w(t,)]\displaystyle\leq\sum_{w\in\mathcal{W}}\Pr_{a\in_{R}\mathcal{A}}\left[\exists|v\rangle\in\mathcal{V},|\langle M_{a},P^{\tau^{(t,\infty)}}_{X|v,w}\rangle|\geq 2^{-r}\right]\cdot\mathrm{Tr}[\tau^{(t,\infty)}_{XV|w}]
22rw𝒲Tr[τXV|w(t,)]22r.\displaystyle\leq 2^{-2r}\cdot\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,\infty)}_{XV|w}]\leq 2^{-2r}.\qed

4.3.4 Evolution preserves trace distance

Lemma 4.14.

For every 0t<T0\leq t<T, we have τXVW(t+1)ρXVW(t+1)Tr𝐄aR𝒜τXVW(t,a)ρXVW(t)Tr\big{\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\rho^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}.

Proof.

Recall that

ρXVW(t+1)=𝐄aR𝒜[x𝒳|xx|Φt,a,M(a,x)(ρVW|x(t))],\rho^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right],
τXVW(t+1)=𝐄aR𝒜[x𝒳|xx|Φt,a,M(a,x)(τVW|x(t,a))].\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

Therefore by triangle inequality and contractivity of quantum channels under trace norms,

τXVW(t+1)ρXVW(t+1)Tr\displaystyle\big{\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\|}_{\mathrm{Tr}} 𝐄aR𝒜x𝒳|xx|(Φt,a,M(a,x)(τVW|x(t,a))Φt,a,M(a,x)(ρVW|x(t)))Tr\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left\|\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\left(\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right)\right\|_{\mathrm{Tr}}
=𝐄aR𝒜x𝒳Φt,a,M(a,x)(τVW|x(t,a))Φt,a,M(a,x)(ρVW|x(t))Tr\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\left\|\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}-\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{VW|x}\big{)}\right\|_{\mathrm{Tr}}
𝐄aR𝒜x𝒳τVW|x(t,a)ρVW|x(t)Tr\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\sum_{x\in\mathcal{X}}\big{\|}\tau^{(t,a)}_{VW|x}-\rho^{(t)}_{VW|x}\big{\|}_{\mathrm{Tr}}
=𝐄aR𝒜τXVW(t,a)ρXVW(t)Tr.\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\big{\|}\tau^{(t,a)}_{XVW}-\rho^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}.\qed

4.4 Proof of Theorem 2

Proof.

First, combining Corollaries 4.6, 4.8, 4.10, 4.13 and 4.14 we have

τXVW(t+1)ρXVW(t+1)TrτXVW(t)ρXVW(t)Tr+82q2r+25r+22r.\big{\|}\tau^{(t+1)}_{XVW}-\rho^{(t+1)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq\big{\|}\tau^{(t)}_{XVW}-\rho^{(t)}_{XVW}\big{\|}_{\mathrm{Tr}}+8\cdot 2^{q-2r}+2^{-5r}+2^{-2r}.

Since τXVW(0)=ρXVW(0)\tau^{(0)}_{XVW}=\rho^{(0)}_{XVW}, by triangle inequality we know that τXVW(T)ρXVW(T)TrT102q2r102qr\big{\|}\tau^{(T)}_{XVW}-\rho^{(T)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq T\cdot 10\cdot 2^{q-2r}\leq 10\cdot 2^{q-r}, and thus

τXVW(T,)ρXVW(T)Tr102qr+82q2r+25r.\big{\|}\tau^{(T,\infty)}_{XVW}-\rho^{(T)}_{XVW}\big{\|}_{\mathrm{Tr}}\leq 10\cdot 2^{q-r}+8\cdot 2^{q-2r}+2^{-5r}.

This bounds the difference between the measurement probabilities of τXVW(T,)\tau^{(T,\infty)}_{XVW} and ρXVW(T)\rho^{(T)}_{XVW} under any measurement, specifically the difference between the success probability of the branching program ρ\rho and the following value on τ\tau:

x𝒳,v{0,1}q,w𝒲x~(v,w)=xx,v,w|τXVW(T,)|x,v,w=v{0,1}q,w𝒲Tr[τX|v,w(T,)]PX|v,wτ(T,)(x~(v,w)).\sum_{\begin{subarray}{c}x\in\mathcal{X},v\in\{0,1\}^{q},w\in\mathcal{W}\\ \widetilde{x}(v,w)=x\end{subarray}}\langle x,v,w|\tau^{(T,\infty)}_{XVW}|x,v,w\rangle=\sum_{v\in\{0,1\}^{q},w\in\mathcal{W}}\mathrm{Tr}[\tau^{(T,\infty)}_{X|v,w}]\cdot P^{\tau^{(T,\infty)}}_{X|v,w}(\widetilde{x}(v,w)).

Since PX|v,wτ(T,)22+9r2n\big{\|}P^{\tau^{(T,\infty)}}_{X|v,w}\big{\|}_{\infty}\leq 2^{2\ell+9r}\cdot 2^{-n} and Tr[τXVW(T,)]1\mathrm{Tr}[\tau^{(T,\infty)}_{XVW}]\leq 1, the above value is at most 22+9r2n2^{2\ell+9r}\cdot 2^{-n}. Therefore the success probability of the branching program ρ\rho is at most (recall that 2+9rnr2\ell+9r-n\leq-r)

102qr+82q2r+25r+22+9r2n=O(2qr).10\cdot 2^{q-r}+8\cdot 2^{q-2r}+2^{-5r}+2^{2\ell+9r}\cdot 2^{-n}=O(2^{q-r}).\qed

5 Proof of Lemma 4.5

The first step towards proving Lemma 4.5 is to analyze how PX|v,wτ(t)P^{\tau^{(t)}}_{X|v,w} evolves according to the rule

τXVW(t+1)=𝐄aR𝒜[x𝒳|xx|Φt,a,M(a,x)(τVW|x(t,a))].\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\tau^{(t,a)}_{VW|x}\big{)}\right].

We introduce the following notations. For every a𝒜a\in\mathcal{A} and b{1,1}b\in\{-1,1\}, let

𝟙a,b=12(1+bMa),\mathbbm{1}_{a,b}=\frac{1}{2}(\vec{1}+b\cdot M_{a}),

which is a 0-11 vector that indicates whether M(a,x)=bM(a,x)=b. Let

τXVW(t,a,b)=(Diag𝟙a,b𝕀VW)τXVW(t,a),\tau^{(t,a,b)}_{XVW}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{VW})\tau^{(t,a)}_{XVW}, (7)

so that we can write

τXVW(t+1)=𝐄aR𝒜[(𝕀XΦt,a,1)(τXVW(t,a,1))+(𝕀XΦt,a,1)(τXVW(t,a,1))].\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1})\big{(}\tau^{(t,a,1)}_{XVW}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1})\big{(}\tau^{(t,a,-1)}_{XVW}\big{)}\right]. (8)

Thus 3.3 implies that PX|v,wτ(t+1)P^{\tau^{(t+1)}}_{X|v,w} is a convex combination of PX|v,wτ(t,a,b)P^{\tau^{(t,a,b)}}_{X|v^{\prime},w^{\prime}} for some a,b,wa,b,w^{\prime} and |v|v^{\prime}\rangle.

5.1 Target Distribution and Badness

Before considering the target distribution, let us first establish that the 2\ell_{2}-norms of PX|v,wτ(t)P^{\tau^{(t)}}_{X|v,w} cannot be too large:

Lemma 5.1.

For every 0tT0\leq t\leq T, |v𝒱|v\rangle\in\mathcal{V}, w𝒲w\in\mathcal{W}, we have

PX|v,wτ(t)2422n/2.\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.
Proof.

When t=0t=0 the statement is clearly true as PX|v,wτ(0)P^{\tau^{(0)}}_{X|v,w} is always uniform.

Now assume t>0t>0. By Lemma 4.1 and Lemma 4.11 we know that

PX|v,wτ(t1,a)2(12r)122n/2\big{\|}P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}\leq(1-2^{-r})^{-1}\cdot 2^{\ell}\cdot 2^{-n/2}

for every w𝒲,|v𝒱w^{\prime}\in\mathcal{W},|v^{\prime}\rangle\in\mathcal{V} and a𝒜a\in\mathcal{A}, as τXVW(t1,a)\tau_{XVW}^{(t-1,a)} is truncated from τXVW(t1,)\tau_{XVW}^{(t-1,\infty)}. Since Ga(PX|v,wτ(t1,a))G_{a}(P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}) is true, meaning that the distribution is evenly divided by aa, we further have

PX|v,wτ(t1,a,b)2=𝟙a,bPX|v,wτ(t1,a)2𝟙a,bPX|v,wτ(t1,a)12(12r)1PX|v,wτ(t1,a)2422n/2.\big{\|}P^{\tau^{(t-1,a,b)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}=\frac{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{1}}\leq 2(1-2^{-r})^{-1}\cdot\big{\|}P^{\tau^{(t-1,a)}}_{X|v^{\prime},w^{\prime}}\big{\|}_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.

Since PX|v,wτ(t)P^{\tau^{(t)}}_{X|v,w} is a convex combination of PX|v,wτ(t1,a,b)P^{\tau^{(t-1,a,b)}}_{X|v^{\prime},w^{\prime}}, by convexity its 2\ell_{2}-norm is bounded by 422n/24\cdot 2^{\ell}\cdot 2^{-n/2}. ∎

From now on we use PP to denote a fixed target distribution (which we will later choose to be the distribution in Lemma 4.5), such that

22n/2P2422n/2.2^{\ell}\cdot 2^{-n/2}\leq\|P\|_{2}\leq 4\cdot 2^{\ell}\cdot 2^{-n/2}.

We want to bound the progress of PX|v,wτ(t),P\langle P^{\tau^{(t)}}_{X|v,w},P\rangle, which starts off as 2n2^{-n} at t=0t=0, and becomes at least 222n2^{2\ell}\cdot 2^{-n} when PX|v,wτ(t)=PP^{\tau^{(t)}}_{X|v,w}=P. Note that by Cauchy-Schwarz we always have

PX|v,wτ(t),PPX|v,wτ(t)2P216222n.\langle P^{\tau^{(t)}}_{X|v,w},P\rangle\leq\big{\|}P^{\tau^{(t)}}_{X|v,w}\big{\|}_{2}\big{\|}P\big{\|}_{2}\leq 16\cdot 2^{2\ell}\cdot 2^{-n}. (9)

In order to bound the progress, we introduce some new notations. For any superscript (such as (t,a)(t,a)) on the partial systems, we use σXVW\sigma_{XVW} to denote τXVW(DiagP𝕀VW)\tau_{XVW}(\operatorname{Diag}P\otimes\mathbb{I}_{VW}). Notice that

Tr[σX|v,w]=Tr[τX|v,wDiagP]=Tr[τX|v,w]PX|v,wτ,P.\mathrm{Tr}[\sigma_{X|v,w}]=\mathrm{Tr}[\tau_{X|v,w}\operatorname{Diag}P]=\mathrm{Tr}[\tau_{X|v,w}]\cdot\langle P^{\tau}_{X|v,w},P\rangle.

Similarly, PX|v,wσP^{\sigma}_{X|v,w} can be deduced from PX|v,wτP^{\tau}_{X|v,w} via

PX|v,wσ(x)=Tr[τX|v,w]Tr[σX|v,w]PX|v,wτ(x)P(x)=PX|v,wτ(x)P(x)PX|v,wτ,P.P^{\sigma}_{X|v,w}(x)=\frac{\mathrm{Tr}[\tau_{X|v,w}]}{\mathrm{Tr}[\sigma_{X|v,w}]}\cdot P^{\tau}_{X|v,w}(x)\cdot P(x)=\frac{P^{\tau}_{X|v,w}(x)\cdot P(x)}{\langle P^{\tau}_{X|v,w},P\rangle}. (10)

Therefore we can bound the 2\ell_{2} norm of PX|v,wσP^{\sigma}_{X|v,w} as

PX|v,wσ21PX|v,wτ,PPX|v,wτP2.\big{\|}P^{\sigma}_{X|v,w}\big{\|}_{2}\leq\frac{1}{\langle P^{\tau}_{X|v,w},P\rangle}\cdot\big{\|}P^{\tau}_{X|v,w}\big{\|}_{\infty}\cdot\big{\|}P\big{\|}_{2}.

Now we can identity the places where PX|v,wτ(t),P\langle P^{\tau^{(t)}}_{X|v,w},P\rangle increases by a lot, which happens when the inner product is not evenly divided by some a𝒜a\in\mathcal{A} (we will see the reason in the analysis later). Formally, at stage 0t<T0\leq t<T, we say (w,a)(w,a) is bad if

|v𝒱, s.t. |Ma,PX|v,wσ(t,a)|>2r and PX|v,wτ(t,a),P122n.\exists|v\rangle\in\mathcal{V}\textrm{, s.t. }|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|>2^{-r}\textrm{ and }\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}. (11)
Lemma 5.2.

For every 0t<T0\leq t<T and w𝒲w\in\mathcal{W}, we have

PraR𝒜[(w,a) is bad]2k.\Pr_{a\in_{R}\mathcal{A}}[\textrm{$(w,a)$ is bad}]\leq 2^{-k}.
Proof.

Since τXVW(t,a)\tau^{(t,a)}_{XVW} is truncated from τXVW(t,)\tau^{(t,\infty)}_{XVW}, Lemma 4.1 shows that for every |v𝒱|v\rangle\in\mathcal{V}, w𝒲w\in\mathcal{W} and a𝒜a\in\mathcal{A} there is |v𝒱|v^{\prime}\rangle\in\mathcal{V} such that

PX|v,wτ(t,a)=PX|v,wτ(t,)P^{\tau^{(t,a)}}_{X|v,w}=P^{\tau^{(t,\infty)}}_{X|v^{\prime},w}

and by Eq. 10 it also implies that

PX|v,wσ(t,a)=PX|v,wσ(t,).P^{\sigma^{(t,a)}}_{X|v,w}=P^{\sigma^{(t,\infty)}}_{X|v^{\prime},w}.

Now fix some w𝒲w\in\mathcal{W}, and let 𝒜𝒜\mathcal{A}^{\prime}\subseteq\mathcal{A} be the set of of a𝒜a\in\mathcal{A} such that

|v𝒱, s.t. |Ma,PX|v,wσ(t,)|>2r and PX|v,wτ(t,),P122n.\exists|v\rangle\in\mathcal{V}\textrm{, s.t. }|\langle M_{a},P^{\sigma^{(t,\infty)}}_{X|v,w}\rangle|>2^{-r}\textrm{ and }\langle P^{\tau^{(t,\infty)}}_{X|v,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}.

Then 𝒜\mathcal{A}^{\prime} contains all aa such that (w,a)(w,a) is bad, and our goal is to bound the fraction of 𝒜\mathcal{A}^{\prime} in 𝒜\mathcal{A}.

In the rest of the proof we temporarily omit the super script and write τ(t,)\tau^{(t,\infty)} and σ(t,)\sigma^{(t,\infty)} simply as τ\tau and σ\sigma. For the same reason as in Lemma 4.12 we can assume that τV|w=𝕀V\tau_{V|w}=\mathbb{I}_{V}, and thus

v|σV|w|v=Tr[σX|v,w]=PX|v,wτ,P, and Tr[σXV|w]=PX|wτ,P16222n.\langle v|\sigma_{V|w}|v\rangle=\mathrm{Tr}[\sigma_{X|v,w}]=\langle P^{\tau}_{X|v,w},P\rangle,{\textrm{ and }\mathrm{Tr}[\sigma_{XV|w}]=\langle P^{\tau}_{X|w},P\rangle\leq 16\cdot 2^{2\ell}\cdot 2^{-n}.}

where the last inequality is by Lemma 4.11 and Cauchy-Schwarz, in the same way as Eq. 9.

Suppose that we have diagonalization σV|w=UDU\sigma_{V|w}=U^{\dagger}DU, where UU is unitary and DD is diagonal and non-negative. Let 𝒱𝒱\mathcal{V}^{\prime}\subseteq\mathcal{V} be the subspace spanned by U|eU^{\dagger}|e\rangle over the computational basis vectors |e𝒱|e\rangle\in\mathcal{V} such that e|D|e24r222n\langle e|D|e\rangle\geq 2^{-4r}\cdot{2^{-2\ell}}\cdot 2^{-n}. So for every |v𝒱|v\rangle\in\mathcal{V}^{\prime} we have

PX|v,wτ,P=Tr[σX|v,w]24r222n.\langle P^{\tau}_{X|v,w},P\rangle=\mathrm{Tr}[\sigma_{X|v,w}]\geq 2^{-4r}\cdot{2^{-2\ell}}\cdot 2^{-n}.

We claim that for every a𝒜a\in\mathcal{A}^{\prime}, there exists |v𝒱|v\rangle\in\mathcal{V}^{\prime} such that |Ma,PX|v,wσ|>122r|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|>\frac{1}{2}\cdot 2^{-r}. To prove the claim, let Π\Pi be the projection operator from 𝒱\mathcal{V} to 𝒱\mathcal{V}^{\prime}, and then (𝕀XΠ)σXV|w(𝕀XΠ)(\mathbb{I}_{X}\otimes\Pi)\sigma_{XV|w}(\mathbb{I}_{X}\otimes\Pi) can be conceptually seen as a truncated partial system σXV|w|G\sigma^{|G}_{XV|w} where G(v,w)G(v,w) holds when Tr[σX|v,w]24r22n\mathrm{Tr}[\sigma_{X|v,w}]\geq 2^{-4r-2\ell}\cdot 2^{-n} for the fixed ww. By Lemma 4.3 we have

σXV|w|GσXV|wTr32q24r2nTr[σXV|w]122q2r2n.\big{\|}\sigma^{|G}_{XV|w}-\sigma_{XV|w}\big{\|}_{\mathrm{Tr}}\leq 3\cdot 2^{q}\cdot{\sqrt{2^{-4r-2\ell-n}\cdot\mathrm{Tr}[\sigma_{XV|w}]}}\leq 12\cdot 2^{q-2r}\cdot 2^{-n}.

Since a𝒜a\in\mathcal{A}^{\prime}, assume for |u𝒱|u\rangle\in\mathcal{V} we have |Ma,PX|u,wσ|>2r|\langle M_{a},P^{\sigma}_{X|u,w}\rangle|>2^{-r} and Tr[σX|u,w]=PX|u,wτ,P122n\mathrm{Tr}[\sigma_{X|u,w}]=\langle P^{\tau}_{X|u,w},P\rangle\geq\frac{1}{2}\cdot 2^{-n}. Let |vΠ|u|v\rangle\sim\Pi|u\rangle, then we have

PX|u,wσPX|v,wσ1=\displaystyle\big{\|}P^{\sigma}_{X|u,w}-P^{\sigma}_{X|v,w}\big{\|}_{1}= PX|u,wσPX|u,wσ|G1σX|u,wTr[σX|u,w]σX|u,w|GTr[σX|u,w|G]Tr\displaystyle\ \big{\|}P^{\sigma}_{X|u,w}-P^{\sigma^{|G}}_{X|u,w}\big{\|}_{1}\leq\left\|\frac{\sigma_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}-\frac{\sigma^{|G}_{X|u,w}}{\mathrm{Tr}[\sigma^{|G}_{X|u,w}]}\right\|_{\mathrm{Tr}}
\displaystyle\leq σX|u,wTr[σX|u,w]σX|u,w|GTr[σX|u,w]Tr+σX|u,w|GTr[σX|u,w]σX|u,w|GTr[σX|u,w|G]Tr\displaystyle\ \left\|{\frac{\sigma_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}-\frac{\sigma^{|G}_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}}\right\|_{\mathrm{Tr}}+\left\|{\frac{\sigma^{|G}_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}-\frac{\sigma^{|G}_{X|u,w}}{\mathrm{Tr}[\sigma^{|G}_{X|u,w}]}}\right\|_{\mathrm{Tr}}
=\displaystyle= σX|u,wTr[σX|u,w]σX|u,w|GTr[σX|u,w]Tr+|1Tr[σX|u,w]1Tr[σX|u,w|G]|Tr[σX|u,w|G]\displaystyle\ \left\|{\frac{\sigma_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}-\frac{\sigma^{|G}_{X|u,w}}{\mathrm{Tr}[\sigma_{X|u,w}]}}\right\|_{\mathrm{Tr}}+\left|\frac{1}{\mathrm{Tr}[\sigma_{X|u,w}]}-\frac{1}{\mathrm{Tr}[\sigma^{|G}_{X|u,w}]}\right|\cdot\mathrm{Tr}[\sigma^{|G}_{X|u,w}]
=\displaystyle= σX|u,wσX|u,w|GTrTr[σX|u,w]+|Tr[σX|u,w|G]Tr[σX|u,w]|Tr[σX|u,w]\displaystyle\frac{\big{\|}\sigma_{X|u,w}-\sigma^{|G}_{X|u,w}\big{\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X|u,w}]}+\frac{\left|\mathrm{Tr}[\sigma^{|G}_{X|u,w}]-\mathrm{Tr}[\sigma_{X|u,w}]\right|}{\mathrm{Tr}[\sigma_{X|u,w}]}
\displaystyle\leq 2σX|u,wσX|u,w|GTrTr[σX|u,w]2σXV|wσXV|w|GTrTr[σX|u,w]482q2r122r,\displaystyle\ \frac{2\big{\|}\sigma_{X|u,w}-\sigma^{|G}_{X|u,w}\big{\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X|u,w}]}\leq\frac{2\big{\|}\sigma_{XV|w}-\sigma^{|G}_{XV|w}\big{\|}_{\mathrm{Tr}}}{\mathrm{Tr}[\sigma_{X|u,w}]}\leq 48\cdot 2^{q-2r}\leq\frac{1}{2}\cdot 2^{-r},

where the last step is due to qr7q\leq r-7. Thus

|Ma,PX|v,wσ||Ma,PX|u,wσ|PX|u,wσPX|v,wσ1>122r.|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq|\langle M_{a},P^{\sigma}_{X|u,w}\rangle|-\big{\|}P^{\sigma}_{X|u,w}-P^{\sigma}_{X|v,w}\big{\|}_{1}>\frac{1}{2}\cdot 2^{-r}.

Similarly to the proof for Lemma 4.12, for each a𝒜a\in\mathcal{A}^{\prime} let

πa=TrX[(DiagMaUD1/2U)σXV|w(𝕀XUD1/2U)]\pi_{a}=\mathrm{Tr}_{X}[(\operatorname{Diag}M_{a}\otimes U^{\dagger}D^{-1/2}U)\cdot\sigma_{XV|w}\cdot(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U)]

which is a Hermitian operator on 𝒱\mathcal{V}. For each |v𝒱|v\rangle\in\mathcal{V}, let |vUD1/2U|v|v^{\prime}\rangle\sim U^{\dagger}D^{1/2}U|v\rangle. Recall that σV|w=UDU\sigma_{V|w}=U^{\dagger}DU, and therefore

PX|v,wσ\displaystyle P^{\sigma}_{X|v,w} =diag(𝕀Xv|)σXV|w(𝕀X|v)v|σV|w|v\displaystyle=\frac{\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v|)\sigma_{XV|w}(\mathbb{I}_{X}\otimes|v\rangle)}{\langle v|\sigma_{V|w}|v\rangle}
=diag(𝕀Xv|UD1/2U)σXV|w(𝕀XUD1/2U|v)v|UD1/2UσV|wUD1/2U|v\displaystyle=\frac{\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}|U^{\dagger}D^{-1/2}U)\sigma_{XV|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U|v^{\prime}\rangle)}{\langle v^{\prime}|U^{\dagger}D^{-1/2}U\sigma_{V|w}U^{\dagger}D^{-1/2}U|v^{\prime}\rangle}
=diag(𝕀Xv|UD1/2U)σXV|w(𝕀XUD1/2U|v).\displaystyle=\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}|U^{\dagger}D^{-1/2}U)\sigma_{XV|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U|v^{\prime}\rangle).

And that means

v|πa|v=Ma,diag(𝕀Xv|UD1/2U)σXV|w(𝕀XUD1/2U|v)=Ma,PX|v,wσ.\langle v^{\prime}|\pi_{a}|v^{\prime}\rangle=\left\langle M_{a},\operatorname{diag}\ (\mathbb{I}_{X}\otimes\langle v^{\prime}|U^{\dagger}D^{-1/2}U)\sigma_{XV|w}(\mathbb{I}_{X}\otimes U^{\dagger}D^{-1/2}U|v^{\prime}\rangle)\right\rangle=\langle M_{a},P^{\sigma}_{X|v,w}\rangle.

We showed above that there exists |v𝒱|v\rangle\in\mathcal{V}^{\prime}, and thus |v𝒱|v^{\prime}\rangle\in\mathcal{V}^{\prime} such that

|v|πa|v|=|Ma,PX|v,wσ|122r,|\langle v^{\prime}|\pi_{a}|v^{\prime}\rangle|=\left|\langle M_{a},P^{\sigma}_{X|v,w}\rangle\right|\geq\frac{1}{2}\cdot 2^{-r},

which means that for ΠπaΠ\Pi\pi_{a}\Pi, the restriction of πa\pi_{a} on 𝒱\mathcal{V}^{\prime}, we have ΠπaΠ2122r\|\Pi\pi_{a}\Pi\|_{2}\geq\frac{1}{2}\cdot 2^{-r}. Now consider a uniformly random unit vector |v|v^{\prime}\rangle in 𝒱\mathcal{V}^{\prime}, and by Lemma 3.1 we know that for some absolute constant cc,

Pr|v[|v|σa|v|2r]12(q+r+1r)/2ce2q12rce112.\Pr_{|v\rangle^{\prime}}\left[|\langle v^{\prime}|\sigma_{a}|v^{\prime}\rangle|\geq 2^{-r^{\prime}}\right]\geq 1-2^{(q+r+1-r^{\prime})/2}c-e^{-2^{q}}\geq 1-2^{-r}c-e^{-1}\geq\frac{1}{2}.

Therefore, for the random vector |vUD1/2U|v|v\rangle\sim U^{\dagger}D^{-1/2}U|v^{\prime}\rangle where |v|v^{\prime}\rangle is uniform in 𝒱\mathcal{V}^{\prime}, we conclude that

Pr|v[|Ma,PX|v,wσ|2r]12.\Pr_{|v\rangle}\left[|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq 2^{-r^{\prime}}\right]\geq\frac{1}{2}.

On the other hand, as |v𝒱|v^{\prime}\rangle\in\mathcal{V}^{\prime}, it also holds that |v𝒱|v\rangle\in\mathcal{V}^{\prime}, therefore PX|v,wτ,P24r222n\langle P^{\tau}_{X|v,w},P\rangle\geq 2^{-4r}\cdot 2^{-2\ell}\cdot 2^{-n} is always true. Thus there exists a |v𝒱|v\rangle\in\mathcal{V} that simultaneously satisfies

PX|v,wτ,P24r222nand|Ma,PX|v,wσ|2r\langle P^{\tau}_{X|v,w},P\rangle\geq 2^{-4r}\cdot 2^{-2\ell}\cdot 2^{-n}\quad\textrm{and}\quad|\langle M_{a},P^{\sigma}_{X|v,w}\rangle|\geq 2^{-r^{\prime}}

for at least 1/21/2 of a𝒜a\in\mathcal{A}^{\prime}. Since

PX|v,wσ21PX|v,wτ,PPX|v,wτP2425+13r2n/2=22n/2,\big{\|}P^{\sigma}_{X|v,w}\big{\|}_{2}\leq\frac{1}{\langle P^{\tau}_{X|v,w},P\rangle}\cdot\big{\|}P^{\tau}_{X|v,w}\big{\|}_{\infty}\cdot\big{\|}P\big{\|}_{2}\leq 4\cdot 2^{{5\ell+13r}}\cdot 2^{-n/2}=2^{\ell^{\prime}}\cdot 2^{-n/2},

and MM is a (k,)(k^{\prime},\ell^{\prime})-extractor with error 2r2^{-r^{\prime}}, there are at most 2k2^{-k^{\prime}} fraction of a𝒜a\in\mathcal{A} such that |Ma,PX|v,wσ|2r|\langle M_{a},P^{\sigma}_{X|v^{\prime},w}\rangle|\geq 2^{-r^{\prime}}, which means that

PraR𝒜[(w,a) is bad]PraR𝒜[a𝒜]22k=2k.\Pr_{a\in_{R}\mathcal{A}}[\textrm{$(w,a)$ is bad}]\leq\Pr_{a\in_{R}\mathcal{A}}[a\in\mathcal{A}^{\prime}]\leq 2\cdot 2^{-k^{\prime}}=2^{-k}.\qed

5.2 Badness Levels

At stage tt, for each classical memory state w𝒲w\in\mathcal{W} we count how many times the path to it has been bad, which is a random variable depending on the previous random choices of a𝒜a\in\mathcal{A}. This is stored in another classical register BB, which we call badness level and takes values β{0,,T}\beta\in\{0,\ldots,T\}. It is initially set to be 0, that is, we let

τXVWB(0)=τXVW(0)|00|B.\tau^{(0)}_{XVWB}=\tau^{(0)}_{XVW}\otimes|0\rangle\langle 0|_{B}.

We ensure that the distribution of BB always only depends on WW and is independent of XX and VV conditioned on WW, using the following updating rules on the combined system τXVWB\tau_{XVWB} for each stage 0t<T0\leq t<T:

  • The truncation steps are executed independently of BB. Therefore, for each a𝒜a\in\mathcal{A} we let

    τXVWB(t,a)=w𝒲τXV|w(t,a)|ww|DiagPB|wτ(t).\tau^{(t,a)}_{XVWB}=\sum_{w\in\mathcal{W}}\tau^{(t,a)}_{XV|w}\otimes|w\rangle\langle w|\otimes\operatorname{Diag}P^{\tau^{(t)}}_{B|w}. (12)
  • The value of BB updates before the evolution step, where for each a𝒜a\in\mathcal{A} and b{1,1}b\in\{-1,1\} we let

    τXVWB(t,a,b)=(Diag𝟙a,b𝕀VUa)τXVWB(t,a)(𝕀XVUa).\tau^{(t,a,b)}_{XVWB}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{V}\otimes U_{a})\tau^{(t,a)}_{XVWB}(\mathbb{I}_{XV}\otimes U_{a}^{\dagger}).

    Here UaU_{a} is a permutation operator, depending on τXVW(t,a)\tau^{(t,a)}_{XVW}, acting on 𝒲{0,,T}\mathcal{W}\otimes\{0,\ldots,T\} such that

    Ua|w|β={|w|(β+1)mod(T+1) if (w,a) is bad,|w|β otherwise.U_{a}|w\rangle|\beta\rangle=\left\{\begin{array}[]{ll}|w\rangle|(\beta+1)\mathop{\mathrm{mod}}(T+1)\rangle&\textrm{ if $(w,a)$ is bad,}\\ |w\rangle|\beta\rangle&\textrm{ otherwise.}\end{array}\right.
  • For the evolution step, we apply the channels Φt,a,b\Phi_{t,a,b} on the memories WW and VV to get

    τXVWB(t+1)=𝐄aR𝒜[(𝕀XΦt,a,1𝕀B)(τXVWB(t,a,1))+(𝕀XΦt,a,1𝕀B)(τXVWB(t,a,1))].\tau^{(t+1)}_{XVWB}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1}\otimes\mathbb{I}_{B})\big{(}\tau^{(t,a,1)}_{XVWB}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1}\otimes\mathbb{I}_{B})\big{(}\tau^{(t,a,-1)}_{XVWB}\big{)}\right].

Notice that the evolution step might introduce dependencies between X,VX,V and BB. However, such dependencies are eliminated later due to how we handle the truncation steps (12), and thus do not affect our proof.

We can check that the combined partial system τXVWB(t)\tau^{(t)}_{XVWB} defined above is consistent with the partial system τXVW(t)\tau^{(t)}_{XVW} that we discussed in previous sections, in the sense that TrB[τXVWB(t)]=τXVW(t)\mathrm{Tr}_{B}[\tau^{(t)}_{XVWB}]=\tau^{(t)}_{XVW} always holds:

  • For the truncation step, it is straightforward to check that

    TrB[τXVWB(t,a)]=w𝒲τXV|w(t,a)|ww|=τXVW(t,a).\mathrm{Tr}_{B}[\tau^{(t,a)}_{XVWB}]=\sum_{w\in\mathcal{W}}\tau^{(t,a)}_{XV|w}\otimes|w\rangle\langle w|=\tau^{(t,a)}_{XVW}.
  • The permutation operator UaU_{a} acts on 𝒲\mathcal{W} as identity since

    TrB[Ua|w,βw,β|Ua]=|ww|.\mathrm{Tr}_{B}\left[U_{a}|w,\beta\rangle\langle w,\beta|U_{a}^{\dagger}\right]=|w\rangle\langle w|.

    Recalling Eq. 7 that τXVW(t,a,b)=(Diag𝟙a,b𝕀V)τXVW(t,a)\tau^{(t,a,b)}_{XVW}=(\operatorname{Diag}\mathbbm{1}_{a,b}\otimes\mathbb{I}_{V})\tau^{(t,a)}_{XVW}, we have TrB[τXVWB(t,a,b)]=τXVW(t,a,b)\mathrm{Tr}_{B}[\tau^{(t,a,b)}_{XVWB}]=\tau^{(t,a,b)}_{XVW}.

  • The evolution step can be checked directly from the formula without BB (Eq. 8):

    τXVW(t+1)=𝐄aR𝒜[(𝕀XΦt,a,1)(τXVW(t,a,1))+(𝕀XΦt,a,1)(τXVW(t,a,1))].\tau^{(t+1)}_{XVW}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[(\mathbb{I}_{X}\otimes\Phi_{t,a,1})\big{(}\tau^{(t,a,1)}_{XVW}\big{)}+(\mathbb{I}_{X}\otimes\Phi_{t,a,-1})\big{(}\tau^{(t,a,-1)}_{XVW}\big{)}\right].

So all previously proved properties about τXVW(t)\tau^{(t)}_{XVW} are preserved. In addition, we prove the following two properties about badness levels.

Lemma 5.3.

For every 0tT0\leq t\leq T, |v𝒱|v\rangle\in\mathcal{V} and w𝒲w\in\mathcal{W}, we have

PX|v,wτ(t),Pβ=0TPB|wτ(t)(β)2β2n(12r)3t.\langle P^{\tau^{(t)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t}.
Proof.

We prove it by induction on tt. For t=0t=0 the lemma is true as PX|v,wτ(t),P=2n\langle P^{\tau^{(t)}}_{X|v,w},P\rangle=2^{-n} and PB|wτ(t)(0)=1P^{\tau^{(t)}}_{B|w}(0)=1.

Suppose the lemma holds true for some t<Tt<T. By a similar argument as in Lemma 4.11 and applying Lemma 4.1 multiple times, we know that for every |v𝒱,w𝒲|v\rangle\in\mathcal{V},w\in\mathcal{W} and a𝒜a\in\mathcal{A}, there exists |v|v^{\prime}\rangleand |v′′𝒱|v^{\prime\prime}\rangle\in\mathcal{V} such that

PX|v,wτ(t,a),P=PX|v,wτ(t,),P(12r)1PX|v,wτ(t,),P=(12r)1PX|v′′,wτ(t),P,\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle=\langle P^{\tau^{(t,\circ)}}_{X|v^{\prime},w},P\rangle\leq(1-2^{-r})^{-1}\langle P^{\tau^{(t,\star)}}_{X|v^{\prime},w},P\rangle=(1-2^{-r})^{-1}\langle P^{\tau^{(t)}}_{X|v^{\prime\prime},w},P\rangle,

and therefore

PX|v,wτ(t,a),Pβ=0TPB|wτ(t)(β)2β2n(12r)3t1.\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-1}. (13)

Also, the truncation step by GaG_{a} implies that |Ma,PX|v,wτ(t,a)|2r|\langle M_{a},P^{\tau^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}. That is, for both b{1,1}b\in\{-1,1\},

12r2𝟙a,bPX|v,wτ(t,a)11+2r.1-2^{-r}\leq 2\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}\leq 1+2^{-r}.

Therefore we have, unconditionally

PX|v,wτ(t,a,b),P=𝟙a,bPX|v,wτ(t,a),P𝟙a,bPX|v,wτ(t,a)12(12r)1PX|v,wτ(t,a),P.\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle=\frac{\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}}\leq 2(1-2^{-r})^{-1}\cdot\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle. (14)

When the inner product is evenly divided, i.e. |Ma,PX|v,wσ(t,a)|2r|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}, we further have

𝟙a,bPX|v,wτ(t,a),P12(1+2r)PX|v,wτ(t,a),P12(12r)1PX|v,wτ(t,a),P,\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\frac{1}{2}(1+2^{-r})\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle\leq\frac{1}{2}(1-2^{-r})^{-1}\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle,

which means that

PX|v,wτ(t,a,b),P=𝟙a,bPX|v,wτ(t,a),P𝟙a,bPX|v,wτ(t,a)1(12r)2PX|v,wτ(t,a),P.\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle=\frac{\langle\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w},P\rangle}{\big{\|}\mathbbm{1}_{a,b}\cdot P^{\tau^{(t,a)}}_{X|v,w}\big{\|}_{1}}\leq(1-2^{-r})^{-2}\cdot\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle. (15)

Now there are three cases to discuss:

  • If (w,a)(w,a) is bad, we have PB|wτ(t,a,b)(β)=PB|wτ(t)(β1)P^{\tau^{(t,a,b)}}_{B|w}(\beta)=P^{\tau^{(t)}}_{B|w}(\beta-1) for every β>0\beta>0. Notice that PB|wτ(t)(T)=0P^{\tau^{(t)}}_{B|w}(T)=0 as t<Tt<T, and thus Eq. 13 and Eq. 14 imply that

    PX|v,wτ(t,a,b),P\displaystyle\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle β=0T1PB|wτ(t)(β)2β+12n(12r)3t2\displaystyle\leq\sum_{\beta=0}^{T-1}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta+1}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-2}
    β=0TPB|wτ(t,a,b)(β)2β2n(12r)3(t+1).\displaystyle\leq\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.
  • If (w,a)(w,a) is not bad and |Ma,PX|v,wσ(t,a)|2r|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|\leq 2^{-r}, we have PB|wτ(t,a,b)(β)=PB|wτ(t)(β)P^{\tau^{(t,a,b)}}_{B|w}(\beta)=P^{\tau^{(t)}}_{B|w}(\beta) for every β0\beta\geq 0. Then Eq. 13 and Eq. 15 imply that

    PX|v,wτ(t,a,b),P\displaystyle\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle β=0TPB|wτ(t)(β)2β2n(12r)3t3\displaystyle\leq\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3t-3}
    =β=0TPB|wτ(t,a,b)(β)2β2n(12r)3(t+1).\displaystyle=\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.
  • If (w,a)(w,a) is not bad and |Ma,PX|v,wσ(t,a)|>2r|\langle M_{a},P^{\sigma^{(t,a)}}_{X|v,w}\rangle|>2^{-r}, by the definition of badness (11) we must have PX|v,wτ(t,a),P<122n\langle P^{\tau^{(t,a)}}_{X|v,w},P\rangle<\frac{1}{2}\cdot 2^{-n}. Thus by Eq. 14,

    PX|v,wτ(t,a,b),P<(12r)12nβ=0TPB|wτ(t,a,b)(β)2β2n(12r)3(t+1).\langle P^{\tau^{(t,a,b)}}_{X|v,w},P\rangle<(1-2^{-r})^{-1}\cdot 2^{-n}\leq\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.

The last inequality follows from β=0TPB|wτ(t,a,b)(β)2β2n(12r)3(t+1)2n(12r)3(t+1)\sum_{\beta=0}^{T}P^{\tau^{(t,a,b)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}\geq 2^{-n}(1-2^{-r})^{-3(t+1)}. Hence we obtain the same conclusion from all three cases.

For the evolution step, since BB is classical we can view XX and BB as a whole and apply 3.3 on PXB|v,wτ(t+1)P^{\tau^{(t+1)}}_{XB|v,w}, which asserts that PXB|v,wτ(t+1)P^{\tau^{(t+1)}}_{XB|v,w} is a convex combination of PXB|v,wτ(t,a,b)P^{\tau^{(t,a,b)}}_{XB|v^{\prime},w^{\prime}} for some a,b,wa,b,w^{\prime} and |v|v^{\prime}\rangle. Then by linearity we conclude that 666It should be noted that in τ(t+1)\tau^{(t+1)}, XX and BB are not independent. (In τ(t,a,b)\tau^{(t,a,b)} they are independent (conditioned on v,wv^{\prime},w^{\prime})). Nevertheless, independence of X,BX,B (in τ(t+1)\tau^{(t+1)}) is not needed or used here and we can conclude the final inequality by linearity by taking the corresponding convex combination of all inequalities.

PX|v,wτ(t+1),Pβ=0TPB|wτ(t+1)(β)2β2n(12r)3(t+1).\langle P^{\tau^{(t+1)}}_{X|v,w},P\rangle\leq\sum_{\beta=0}^{T}P^{\tau^{(t+1)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot 2^{-n}\cdot(1-2^{-r})^{-3(t+1)}.\qed
Lemma 5.4.

For every 0βtT0\leq\beta\leq t\leq T we have

β|τB(t)|β2kβ(tβ).\langle\beta|\tau^{(t)}_{B}|\beta\rangle\leq 2^{-k\beta}\binom{t}{\beta}.
Proof.

We prove it by induction on tt. For t=0t=0 the lemma holds as τB(0)=|00|B\tau^{(0)}_{B}=|0\rangle\langle 0|_{B}. Also notice that the lemma is trivially true for every tt when β=0\beta=0.

Now suppose the lemma holds for some tt. By definition we have

τB(t+1)=𝐄aR𝒜[τB(t,a,1)+τB(t,a,1)]=𝐄aR𝒜TrW[UaτWB(t,a)Ua].\tau^{(t+1)}_{B}=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}[\tau^{(t,a,1)}_{B}+\tau^{(t,a,-1)}_{B}]=\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\mathrm{Tr}_{W}[U_{a}\tau^{(t,a)}_{WB}U_{a}^{\dagger}].

Therefore

β|τB(t+1)|β=w𝒲𝐄aR𝒜[w,β|UaτWB(t,a)Ua|w,β].\langle\beta|\tau^{(t+1)}_{B}|\beta\rangle=\sum_{w\in\mathcal{W}}\mathop{\mathbf{E}\mbox{}}\limits_{a\in_{R}\mathcal{A}}\left[\langle w,\beta|U_{a}\tau^{(t,a)}_{WB}U_{a}^{\dagger}|w,\beta\rangle\right].

By Lemma 5.2 we know that for every w𝒲w\in\mathcal{W}, the probability that (w,a)(w,a) is bad for aR𝒜a\in_{R}\mathcal{A} is at most 2k2^{-k}. In other words, for every β>0\beta>0,

Ua|w,β={|w,β, w.p. 12k|w,β1, w.p. 2kU_{a}^{\dagger}|w,\beta\rangle=\left\{\begin{array}[]{ll}|w,\beta\rangle,&\textrm{ w.p. }\geq 1-2^{-k}\\ |w,\beta-1\rangle,&\textrm{ w.p. }\leq 2^{-k}\end{array}\right.

where the probability is taken over the random choice of aa. It means that

β|τB(t+1)|β\displaystyle\langle\beta|\tau^{(t+1)}_{B}|\beta\rangle w𝒲w,β|τWB(t,a)|w,β+2kw𝒲w,β1|τWB(t,a)|w,β1\displaystyle\leq\sum_{w\in\mathcal{W}}\langle w,\beta|\tau^{(t,a)}_{WB}|w,\beta\rangle+2^{-k}\sum_{w\in\mathcal{W}}\langle w,\beta-1|\tau^{(t,a)}_{WB}|w,\beta-1\rangle
=β|τB(t,a)|β+2kβ1|τB(t,a)|β1.\displaystyle=\langle\beta|\tau^{(t,a)}_{B}|\beta\rangle+2^{-k}\cdot\langle\beta-1|\tau^{(t,a)}_{B}|\beta-1\rangle.

Notice that

τB(t,a)=w𝒲Tr[τXV|w(t,a)]DiagPB|wτ(t)w𝒲Tr[τXV|w(t)]DiagPB|wτ(t)=τB(t),\tau^{(t,a)}_{B}=\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t,a)}_{XV|w}]\cdot\operatorname{Diag}P^{\tau^{(t)}}_{B|w}\leq\sum_{w\in\mathcal{W}}\mathrm{Tr}[\tau^{(t)}_{XV|w}]\cdot\operatorname{Diag}P^{\tau^{(t)}}_{B|w}=\tau^{(t)}_{B},

and thus we conclude that

β|τB(t+1)|β\displaystyle\langle\beta|\tau^{(t+1)}_{B}|\beta\rangle β|τB(t)|β+2kβ1|τB(t)|β1\displaystyle\leq\langle\beta|\tau^{(t)}_{B}|\beta\rangle+2^{-k}\cdot\langle\beta-1|\tau^{(t)}_{B}|\beta-1\rangle
2kβ(tβ)+2k2k(β1)(tβ1)=2kβ(t+1β).\displaystyle\leq 2^{-k\beta}\binom{t}{\beta}+2^{-k}\cdot 2^{-k(\beta-1)}\binom{t}{\beta-1}=2^{-k\beta}\binom{t+1}{\beta}.\qed

With the lemmas above in hand, we can finally prove Lemma 4.5.

Proof for Lemma 4.5.

For the target distribution P=PX|v,wτ(t)P=P^{\tau^{(t)}}_{X|v,w} we have PX|v,wτ(t),P>222n\langle P^{\tau^{(t)}}_{X|v,w},P\rangle>2^{2\ell}\cdot 2^{-n}, so by Lemma 5.3,

β=0TPB|wτ(t)(β)2β(12r)3t>22.\sum_{\beta=0}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}\cdot(1-2^{-r})^{-3t}>2^{2\ell}.

Since tT2r2t\leq T\leq 2^{r-2}, we have (12r)3t2(1-2^{-r})^{-3t}\leq 2, and thus

β=TPB|wτ(t)(β)2β>12(222β=012β)>2.\sum_{\beta=\ell}^{T}P^{\tau^{(t)}}_{B|w}(\beta)\cdot 2^{\beta}>\frac{1}{2}\left(2^{2\ell}-2\cdot\sum_{\beta=0}^{\ell-1}2^{\beta}\right)>2^{\ell}.

On the other hand, for every β\beta\geq\ell, by Lemma 5.4,

Tr[τB|w(t)]PB|wτ(t)(β)β|τB(t)|β(2kt)β<2(kr)β,\mathrm{Tr}[\tau^{(t)}_{B|w}]\cdot P^{\tau^{(t)}}_{B|w}(\beta)\leq\langle\beta|\tau^{(t)}_{B}|\beta\rangle\leq(2^{-k}t)^{\beta}<2^{-(k-r)\beta},

and thus by Eq. 6,

Tr[τX|v,w(t)]Tr[τB|w(t)]<2β=T2(kr)β2β22(kr)22m24r.\mathrm{Tr}[\tau^{(t)}_{X|v,w}]\leq\mathrm{Tr}[\tau^{(t)}_{B|w}]<2^{-\ell}\sum_{\beta=\ell}^{T}2^{-(k-r)\beta}\cdot 2^{\beta}\leq 2\cdot 2^{-(k-r)\ell}\leq 2^{-2m}\cdot 2^{-4r}.\qed

Acknowledgement

We are grateful to Uma Girish for many important discussions and suggestions on the draft of this paper, and to the anonymous reviewers for their helpful comments.

References

  • [AAB+19] Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, Fernando GSL Brandao, David A Buell, et al. Quantum supremacy using a programmable superconducting processor. Nature, 574(7779):505–510, 2019.
  • [Aar18] Scott Aaronson. Shadow tomography of quantum states. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 325–338, 2018.
  • [ACQ22] Dorit Aharonov, Jordan Cotler, and Xiao-Liang Qi. Quantum algorithmic measurement. Nature communications, 13(1):1–9, 2022.
  • [ADR02] Yonatan Aumann, Yan Zong Ding, and Michael O Rabin. Everlasting security in the bounded storage model. IEEE Transactions on Information Theory, 48(6):1668–1680, 2002.
  • [AR99] Yonatan Aumann and Michael O Rabin. Information theoretically secure communication in the limited storage space model. In Annual International Cryptology Conference, pages 65–79. Springer, 1999.
  • [Aud07] Koenraad MR Audenaert. A sharp continuity estimate for the von neumann entropy. Journal of Physics A: Mathematical and Theoretical, 40(28):8127, 2007.
  • [BCL20] Sebastien Bubeck, Sitan Chen, and Jerry Li. Entanglement is necessary for optimal quantum property testing. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 692–703. IEEE, 2020.
  • [BGY18] Paul Beame, Shayan Oveis Gharan, and Xin Yang. Time-space tradeoffs for learning finite functions from random evaluations, with applications to polynomials. In COLT, volume 75 of Proceedings of Machine Learning Research, pages 843–856. PMLR, 2018.
  • [BY21] Anne Broadbent and Peter Yuen. Device-independent oblivious transfer from the bounded-quantum-storage-model and computational assumptions. arXiv preprint arXiv:2111.08595, 2021.
  • [CCHL22] Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. Exponential separations between learning with and without quantum memory. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 574–585. IEEE, 2022.
  • [CLO22] Sitan Chen, Jerry Li, and Ryan O’Donnell. Toward instance-optimal state certification with incoherent measurements. In Conference on Learning Theory, pages 2541–2596. PMLR, 2022.
  • [CM97] Christian Cachin and Ueli Maurer. Unconditional security against memory-bounded adversaries. In Annual International Cryptology Conference, pages 292–306. Springer, 1997.
  • [CW01] Anthony Carbery and James Wright. Distributional and LqL^{q} norm inequalities for polynomials over convex bodies in n\mathbb{R}^{n}. Mathematical Research Letters, 8(3):233–248, 2001.
  • [DFR+07] Ivan B Damgård, Serge Fehr, Renato Renner, Louis Salvail, and Christian Schaffner. A tight high-order entropic quantum uncertainty relation with applications. In Annual International Cryptology Conference, pages 360–378. Springer, 2007.
  • [DFSS07] Ivan B Damgård, Serge Fehr, Louis Salvail, and Christian Schaffner. Secure identification and qkd in the bounded-quantum-storage model. In Annual International Cryptology Conference, pages 342–359. Springer, 2007.
  • [DFSS08] Ivan B Damgård, Serge Fehr, Louis Salvail, and Christian Schaffner. Cryptography in the bounded-quantum-storage model. SIAM Journal on Computing, 37(6):1865–1890, 2008.
  • [DM02] Stefan Dziembowski and Ueli Maurer. Tight security proofs for the bounded-storage model. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 341–350, 2002.
  • [DM04] Stefan Dziembowski and Ueli Maurer. On generating the initial key in the bounded-storage model. In International Conference on the Theory and Applications of Cryptographic Techniques, pages 126–137. Springer, 2004.
  • [DQW21] Yevgeniy Dodis, Willy Quach, and Daniel Wichs. Speak much, remember little: Cryptography in the bounded storage model, revisited. Cryptology ePrint Archive, 2021.
  • [DQW22] Yevgeniy Dodis, Willy Quach, and Daniel Wichs. Authentication in the bounded storage model. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 737–766. Springer, 2022.
  • [DR02] Yan Zong Ding and Michael O Rabin. Hyper-encryption and everlasting security. In Annual Symposium on Theoretical Aspects of Computer Science, pages 1–26. Springer, 2002.
  • [FvdG99] Christopher A Fuchs and Jeroen van de Graaf. Cryptographic distinguishability measures for quantum-mechanical states. IEEE Transactions on Information Theory, 45(4):1216–1227, 1999.
  • [GKK+08] Dmitry Gavinsky, Julia Kempe, Iordanis Kerenidis, Ran Raz, and Ronald de Wolf. Exponential separation for one-way quantum communication complexity, with applications to cryptography. SIAM J. Comput., 38(5):1695–1708, 2008.
  • [GKLR21] Sumegha Garg, Pravesh K. Kothari, Pengda Liu, and Ran Raz. Memory-sample lower bounds for learning parity with noise. In APPROX-RANDOM, volume 207 of LIPIcs, pages 60:1–60:19. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • [GKR20] Sumegha Garg, Pravesh K. Kothari, and Ran Raz. Time-space tradeoffs for distinguishing distributions and applications to security of goldreich’s PRG. In APPROX-RANDOM, volume 176 of LIPIcs, pages 21:1–21:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2020.
  • [GRT18] Sumegha Garg, Ran Raz, and Avishay Tal. Extractor-based time-space lower bounds for learning. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, page 990–1002, New York, NY, USA, 2018. Association for Computing Machinery.
  • [GZ19] Jiaxin Guan and Mark Zhandary. Simple schemes in the bounded storage model. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 500–524. Springer, 2019.
  • [HHJ+16] Jeongwan Haah, Aram W Harrow, Zhengfeng Ji, Xiaodi Wu, and Nengkun Yu. Sample-optimal tomography of quantum states. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 913–925, 2016.
  • [HKP20] Hsin-Yuan Huang, Richard Kueng, and John Preskill. Predicting many properties of a quantum system from very few measurements. Nature Physics, 16(10):1050–1057, 2020.
  • [HN06] Danny Harnik and Moni Naor. On everlasting security in the hybrid bounded storage model. In International Colloquium on Automata, Languages, and Programming, pages 192–203. Springer, 2006.
  • [KK12] Roy Kasher and Julia Kempe. Two-source extractors secure against quantum adversaries. Theory of Computing, 8(1):461–486, 2012.
  • [KRT17] Gillat Kol, Ran Raz, and Avishay Tal. Time-space hardness of learning sparse parities. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1067–1080, 2017.
  • [LM00] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5), October 2000.
  • [Lu02] Chi-Jen Lu. Hyper-encryption against space-bounded adversaries from on-line strong extractors. In Annual International Cryptology Conference, pages 257–271. Springer, 2002.
  • [LV21] Jiahui Liu and Satyanarayana Vusirikala. Secure multiparty computation in the bounded storage model. In IMA International Conference on Cryptography and Coding, pages 289–325. Springer, 2021.
  • [Mau92] Ueli M Maurer. Conditionally-perfect secrecy and a provably-secure randomized cipher. Journal of Cryptology, 5(1):53–66, 1992.
  • [MM18] Dana Moshkovitz and Michal Moshkovitz. Entropy samplers and strong generic lower bounds for space bounded learning. In ITCS, volume 94 of LIPIcs, pages 28:1–28:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018.
  • [MSTS04] Tal Moran, Ronen Shaltiel, and Amnon Ta-Shma. Non-interactive timestamping in the bounded storage model. In Annual International Cryptology Conference, pages 460–476. Springer, 2004.
  • [OP04] Masanori Ohya and Dénes Petz. Quantum entropy and its use. Springer Science & Business Media, 2004.
  • [PMLA13] Stefano Pironio, Ll Masanes, Anthony Leverrier, and Antonio Acín. Security of device-independent quantum key distribution in the bounded-quantum-storage model. Physical Review X, 3(3):031007, 2013.
  • [Raz17] Ran Raz. A time-space lower bound for a large class of learning problems. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 732–742, 2017.
  • [Raz18] Ran Raz. Fast learning requires good memory: A time-space lower bound for parity learning. J. ACM, 66(1), dec 2018.
  • [RS08] Alvin C Rencher and G Bruce Schaalje. Linear models in statistics. John Wiley & Sons, 2008.
  • [Sch07] Christian Schaffner. Cryptography in the bounded-quantum-storage model. arXiv preprint arXiv:0709.0289, 2007.
  • [Sha14] Ohad Shamir. Fundamental limits of online and distributed algorithms for statistical learning and estimation. Advances in Neural Information Processing Systems, 27, 2014.
  • [SSV19] Vatsal Sharan, Aaron Sidford, and Gregory Valiant. Memory-sample tradeoffs for linear regression with small error. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pages 890–901, 2019.
  • [SVW16] Jacob Steinhardt, Gregory Valiant, and Stefan Wager. Memory, communication, and statistical queries. In Conference on Learning Theory, pages 1490–1516. PMLR, 2016.
  • [Uhl76] Armin Uhlmann. The “transition probability” in the state space of a *-algebra. Reports on Mathematical Physics, 9(2):273–279, 1976.
  • [Wri16] John Wright. How to learn a quantum state. PhD thesis, Carnegie Mellon University, 2016.
  • [WW08] Stephanie Wehner and Jürg Wullschleger. Composable security in the bounded-quantum-storage model. In International Colloquium on Automata, Languages, and Programming, pages 604–615. Springer, 2008.

Appendix A Bounding Parameters for Theorem 2

Proof.

For Equation 4, since rr/4r\leq r^{\prime}/4 by the definition of rr and qr7q\leq r-7, we have:

q+r+1rq+r+14r2r62r.\displaystyle q+r+1-r^{\prime}\leq q+r+1-4r\leq-2r-6\leq-2r.

For Equation 5, we use the fact that =15(13r2)\ell=\frac{1}{5}(\ell^{\prime}-13r-2), therefore

2+9rn=\displaystyle 2\ell+9r-n= 25(13r2)+9rn\displaystyle\frac{2}{5}(\ell^{\prime}-13r-2)+9r-n
=\displaystyle= 25+195r45n.\displaystyle\frac{2}{5}\ell^{\prime}+\frac{19}{5}r-\frac{4}{5}-n.

Solving the inequality 25+195r45nr\frac{2}{5}\ell^{\prime}+\frac{19}{5}r-\frac{4}{5}-n\leq-r (together with the fact that n\ell^{\prime}\leq n) gives us r18+16r\leq\frac{1}{8}\ell^{\prime}+\frac{1}{6}, which follows since r126+16r\leq\frac{1}{26}\ell^{\prime}+\frac{1}{6} by definition.

For Equation 6, as r126+16r\leq\frac{1}{26}\ell^{\prime}+\frac{1}{6}, r12(k1)r\leq\frac{1}{2}(k^{\prime}-1) and k=k1k=k^{\prime}-1,

kr\displaystyle k-r k(k1)/2=(k1)/2,\displaystyle\geq k-(k^{\prime}-1)/2=(k^{\prime}-1)/2,
\displaystyle\ell =15(13r2)15(21362)>1101.\displaystyle=\frac{1}{5}(\ell^{\prime}-13r-2)\geq\frac{1}{5}\left(\ell^{\prime}-\frac{\ell^{\prime}}{2}-\frac{13}{6}-2\right)>\frac{1}{10}\ell^{\prime}-1.

When \ell^{\prime} is sufficiently large, we have 1101>111+5\frac{1}{10}\ell^{\prime}-1>\frac{1}{11}\ell^{\prime}+5. Thereby, using the fact that m(k1)/44m\leq(k^{\prime}-1)\ell^{\prime}/44:

(kr)\displaystyle(k-r)\ell 12(k1)(111+5)\displaystyle\geq\frac{1}{2}(k^{\prime}-1)\left(\frac{1}{11}\ell^{\prime}+5\right)
=122(k1)+52(k1)\displaystyle=\frac{1}{22}(k^{\prime}-1)\ell^{\prime}+\frac{5}{2}(k^{\prime}-1)
2m+5r\displaystyle\geq 2m+5r
2m+4r+1.\displaystyle\geq 2m+4r+1.\qed

Appendix B Proof for Proposition 4.2

We first state a variant of Fuchs-van de Graaf inequality [FvdG99] on fidelity, defined for two partial density operators ρ\rho and σ\sigma as

F(ρ,σ)=Tr[ρσρ].F(\rho,\sigma)=\mathrm{Tr}\left[\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}}\right].
Lemma B.1.

Let ρ,σ\rho,\sigma be two semi-definite operators. Assume Tr[ρ]Tr[σ]\mathrm{Tr}[\rho]\geq\mathrm{Tr}[\sigma]. Then

12ρσTr14(Tr[ρ]+Tr[σ])2F(ρ,σ)Tr[ρ]2F(ρ,σ).\displaystyle\frac{1}{2}\big{\|}\rho-\sigma\big{\|}_{\mathrm{Tr}}\leq\sqrt{\frac{1}{4}(\mathrm{Tr}[\rho]+\mathrm{Tr}[\sigma])^{2}-F(\rho,\sigma)}\leq\sqrt{\mathrm{Tr}[\rho]^{2}-F(\rho,\sigma)}.

Notice that when Tr[ρ]=Tr[σ]=1\mathrm{Tr}[\rho]=\mathrm{Tr}[\sigma]=1, the above inequality is the original Fuchs-van de Graaf inequality.

Proof.

Let uu and vv be purifications of ρ\rho and σ\sigma, that is, u,vAu,v\in\mathcal{H}\otimes\mathcal{H}_{A} with ρ=TrA[uu]\rho=\mathrm{Tr}_{A}[uu^{\dagger}] and σ=TrA[vv]\sigma=\mathrm{Tr}_{A}[vv^{\dagger}], where \mathcal{H} is the ambient space of ρ\rho and σ\sigma, and A\mathcal{H}_{A} is some finite-dimensional Hilbert space.

Let UU be a unitary that diagonalizes uuvvuu^{\dagger}-vv^{\dagger}, that is there is a diagonal matrix Λd×d\Lambda\in\mathbb{C}^{d\times d} such that uuvv=UΛUuu^{\dagger}-vv^{\dagger}=U\Lambda U^{\dagger}. Let p,q0dp,q\in\mathbb{R}^{d}_{\geq 0} be the diagonal of UuuUU^{\dagger}uu^{\dagger}U and UvvUU^{\dagger}vv^{\dagger}U respectively. We have

uu=Tr[UuuU]\displaystyle u^{\dagger}u=\mathrm{Tr}[U^{\dagger}uu^{\dagger}U] =x[d]p(x),\displaystyle=\sum_{x\in[d]}p(x),
vv=Tr[UvvU]\displaystyle v^{\dagger}v=\mathrm{Tr}[U^{\dagger}vv^{\dagger}U] =x[d]q(x),\displaystyle=\sum_{x\in[d]}q(x),
uuvvTr=ΛTr\displaystyle\big{\|}uu^{\dagger}-vv^{\dagger}\big{\|}_{\mathrm{Tr}}=\big{\|}\Lambda\big{\|}_{\mathrm{Tr}} =x[d]|p(x)q(x)|,\displaystyle=\sum_{x\in[d]}|p(x)-q(x)|,
|u,v|=|Uu,Uv|\displaystyle|\langle u,v\rangle|=|\langle U^{\dagger}u,U^{\dagger}v\rangle| x[d]p(x)q(x).\displaystyle\leq\sum_{x\in[d]}\sqrt{p(x)q(x)}.

Therefore, by Cauchy-Schwarz inequality,

uuvvTr2\displaystyle\big{\|}uu^{\dagger}-vv^{\dagger}\big{\|}^{2}_{\mathrm{Tr}} =(x[d]|p(x)q(x)|)2\displaystyle=\left(\sum_{x\in[d]}|p(x)-q(x)|\right)^{2}
=(x[d]|p(x)q(x)||p(x)+q(x)|)2\displaystyle=\left(\sum_{x\in[d]}\left|\sqrt{p(x)}-\sqrt{q(x)}\right|\cdot\left|\sqrt{p(x)}+\sqrt{q(x)}\right|\right)^{2}
(x[d]|p(x)q(x)|2)(x[d]|p(x)+q(x)|2)\displaystyle\leq\left(\sum_{x\in[d]}\left|\sqrt{p(x)}-\sqrt{q(x)}\right|^{2}\right)\left(\sum_{x\in[d]}\left|\sqrt{p(x)}+\sqrt{q(x)}\right|^{2}\right)
=(x[d]p(x)+x[d]q(x))24(x[d]p(x)q(x))2\displaystyle=\left(\sum_{x\in[d]}p(x)+\sum_{x\in[d]}q(x)\right)^{2}-4\left(\sum_{x\in[d]}\sqrt{p(x)q(x)}\right)^{2}
(uu+vv)24|u,v|2.\displaystyle\leq\left(u^{\dagger}u+v^{\dagger}v\right)^{2}-4|\langle u,v\rangle|^{2}.

Notice that ρσTruuvvTr\big{\|}\rho-\sigma\big{\|}_{\mathrm{Tr}}\leq\big{\|}uu^{\dagger}-vv^{\dagger}\big{\|}_{\mathrm{Tr}}, Tr[ρ]=uu\mathrm{Tr}[\rho]=u^{\dagger}u and Tr[σ]=vv\mathrm{Tr}[\sigma]=v^{\dagger}v. By Uhlmann’s theorem [Uhl76], we can also choose uu and vv such that F(ρ,σ)=|u,v|2F(\rho,\sigma)=|\langle u,v\rangle|^{2}. Plugging them into the above inequality concludes the proof. ∎

Now we are ready to prove Proposition 4.2

Proof for Proposition 4.2.

By Fuchs-van de Graaf inequality, it suffices to prove the following bound on fidelity:

F(ρ,ΠρΠ)Tr[Πρ]2.F(\rho,\Pi\rho\Pi)\geq\mathrm{Tr}[\Pi\rho]^{2}.

Let uu be a purification of ρXV\rho_{XV}, that is, ρ=TrA[uu]\rho=\mathrm{Tr}_{A}[uu^{\dagger}] for some Hilbert space A\mathcal{H}_{A}. Then (Π𝕀A)u\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}u is a purification of ΠρΠ\Pi\rho\Pi. By Uhlmann’s theorem we have

F(ρ,ΠρΠ)|u(Π𝕀A)u|2=Tr[(Π𝕀A)uu]2=Tr[ΠTrA[uu]]2=Tr[Πρ]2.F(\rho,\Pi\rho\Pi)\geq\left|u^{\dagger}\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}u\right|^{2}=\mathrm{Tr}\left[\big{(}\Pi\otimes\mathbb{I}_{A}\big{)}uu^{\dagger}\right]^{2}=\mathrm{Tr}\left[\Pi\cdot\mathrm{Tr}_{A}[uu^{\dagger}]\right]^{2}=\mathrm{Tr}[\Pi\rho]^{2}.\qed

Appendix C Linear Quantum Memory Lower Bound

In this appendix, we prove Theorem 3 that shows a simpler proof for a linear quantum-memory lower bound (without classical memory). While Theorem 3 is qualitatively weaker than our main results in most cases, as it only gives a lower bound for programs with a linear-size quantum memory but without a (possibly quadratic) classical memory, Theorem 3 is technically incomparable to the main results, as it’s stated in terms of quantum extractors and the bound on the quantum-memory size depends on a different parameter of the extractor. Additionally, the proof of Theorem 3 is significantly simpler than the proof of our main theorem.

We first define the quantum extractor property that we need, which is a simplified version of the ones considered in [KK12]. Given a matrix M:𝒜×𝒳{1,1}M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\}, consider two independent sources AA and XX uniformly distributed over 𝒜\mathcal{A} and 𝒳\mathcal{X} respectively. Suppose there is some quantum register VV whose state depends on AA and XX, and they together form a classical-quantum system

ρAXV=a𝒜,x𝒳ρV|a,x,\rho_{AXV}=\bigoplus_{a\in\mathcal{A},x\in\mathcal{X}}\rho_{V|a,x},

where ρV|a,x\rho_{V|a,x} is the state of VV when A=aA=a and X=xX=x. For any function ff on A×XA\times X, we say that VV depends only on f(A,X)f(A,X) if for any a,a𝒜a,a^{\prime}\in\mathcal{A} and x,x𝒳x,x^{\prime}\in\mathcal{X}, whenever f(a,x)=f(a,x)f(a,x)=f(a^{\prime},x^{\prime}) we have ρV|a,x=ρV|a,x\rho_{V|a,x}=\rho_{V|a^{\prime},x^{\prime}}. In particular, VV depending only on AA is equivalent to VV being independent of XX, or ρXV=ρXρV\rho_{XV}=\rho_{X}\otimes\rho_{V}.

We say that MM is an XX-strong (q,r)(q,r)-quantum extractor, if for every classical-quantum system ρAXV\rho_{AXV}, as above, with the qq-qubit quantum subsystem VV that depends only on AA, it holds that

ρM(A,X)XVUρXρVTr2r.\left\|\rho_{M(A,X)XV}-U\otimes\rho_{X}\otimes\rho_{V}\right\|_{\mathrm{Tr}}\leq 2^{-r}.

Here U=(1/2001/2)U=\begin{pmatrix}1/2&0\\ 0&1/2\end{pmatrix} is the uniform operator over one bit, and ρM(A,X)XV\rho_{M(A,X)XV} is the classical-quantum system constructed by adding a new classical register storing the value of M(A,X)M(A,X) and tracing out AA. In other words,

ρM(A,X)XV=b{1,1},x𝒳a𝒜M(a,x)=bρV|a,x.\rho_{M(A,X)XV}=\bigoplus_{b\in\{-1,1\},x\in\mathcal{X}}\sum_{\begin{subarray}{c}a\in\mathcal{A}\\ M(a,x)=b\end{subarray}}\rho_{V|a,x}.

Notice that if we choose VV to be trivial, the above inequality immediately implies that |𝐄[M(A,X)]|2r\left|\mathop{\mathbf{E}\mbox{}}\limits[M(A,X)]\right|\leq 2^{-r}.

As an example, the results in [KK12] imply that the inner product function on nn bits, where 𝒜=𝒳=𝔽2n\mathcal{A}=\mathcal{X}=\mathbb{F}_{2}^{n} and

M(a,x)=(1)ax,M(a,x)=(-1)^{a\cdot x},

is an XX-strong (k,nk)(k,n-k)-quantum extractor for every 2kn2\leq k\leq n.

In this section we prove the following theorem:

Theorem 3.

Let 𝒳,𝒜\mathcal{X},\mathcal{A} be two finite sets with n=log2|𝒳|n=\log_{2}|\mathcal{X}|. Let M:𝒜×𝒳{1,1}M:\mathcal{A}\times\mathcal{X}\rightarrow\{-1,1\} be a matrix which is a XX-strong (q,r)(q,r)-quantum extractor. Let ρ\rho be a branching program for the learning problem corresponding to MM, described by classical-quantum systems ρXV(t)\rho^{(t)}_{XV}, with q/2q/2-qubit quantum memory VV and length TT, and without classical memory. Then the success probability of ρ\rho is at most

2n+8Tn+q2r/4.2^{-n}+8T\sqrt{n+q}\cdot 2^{-r/4}.

To prove the theorem, we first need to define the following measure of dependency:

Definition.

Let ρXV\rho_{XV} be a classical-quantum system over classical XX and quantum VV. The dependency of VV on XX in ρXV\rho_{XV} is defined as

ξρ(X;V)=minτVρXVρXτVTr\xi^{\rho}(X;V)=\min_{\tau_{V}}\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}

where τV\tau_{V} is taken over all density operators on VV. Notice that in this definition taking τV=ρV\tau_{V}=\rho_{V} is almost optimal as we have

ρXVρXρVTrρXVρXτVTr+ρVτVTr2ρXVρXτVTr.\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}\leq\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}+\big{\|}\rho_{V}-\tau_{V}\big{\|}_{\mathrm{Tr}}\leq 2\big{\|}\rho_{XV}-\rho_{X}\otimes\tau_{V}\big{\|}_{\mathrm{Tr}}. (16)

We also consider the quantum mutual information between XX and VV in ρXV\rho_{XV}, defined as

𝐈ρ(X;V)=𝐒(ρX)+𝐒(ρV)𝐒(ρXV)=𝐒(ρXVρXρV),\mathbf{I}_{\rho}\left(X;V\right)=\mathbf{S}(\rho_{X})+\mathbf{S}(\rho_{V})-\mathbf{S}(\rho_{XV})=\mathbf{S}\left(\rho_{XV}\mathrel{\|}\rho_{X}\otimes\rho_{V}\right),

where 𝐒()\mathbf{S}(\cdot) denotes the von Neumann entropy, and 𝐒()\mathbf{S}(\cdot\mathrel{\|}\cdot) denotes quantum relative entropy. When VV consists of qq qubits, we have the following relationship between our dependency measure and quantum mutual information:

Lemma C.1.

12ξρ(X;V)2𝐈ρ(X;V)qξρ(X;V)+2ξρ(X;V)\displaystyle\frac{1}{2}\xi^{\rho}(X;V)^{2}\leq\mathbf{I}_{\rho}\left(X;V\right)\leq q\cdot\xi^{\rho}(X;V)+2\sqrt{\xi^{\rho}(X;V)}.

Proof.

On one hand, using the inequality on quantum relative entropy and trace distance (see e.g. [OP04, Theorem 1.15]), we have

𝐈ρ(X;V)=𝐒(ρXVρXρV)12ρXVρXρVTr212ξρ(X;V)2.\mathbf{I}_{\rho}\left(X;V\right)=\mathbf{S}\left(\rho_{XV}\mathrel{\|}\rho_{X}\otimes\rho_{V}\right)\geq\frac{1}{2}\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}^{2}_{\mathrm{Tr}}\geq\frac{1}{2}\xi^{\rho}(X;V)^{2}.

On the other hand, Fannes-Audenaert inequality [Aud07] tells us that for every x𝒳x\in\mathcal{X}, the difference between the von Neumann entropies of any two states ρ\rho and τ\tau on VV is bounded by

|𝐒(ρ)𝐒(τ)|q12ρτTr+h(12ρτTr)|\mathbf{S}(\rho)-\mathbf{S}(\tau)|\leq q\cdot\frac{1}{2}\big{\|}\rho-\tau\big{\|}_{\mathrm{Tr}}+h\left(\frac{1}{2}\big{\|}\rho-\tau\big{\|}_{\mathrm{Tr}}\right)

where h(ϵ)=ϵlog2ϵ(1ϵ)log2(1ϵ)h(\epsilon)=-\epsilon\log_{2}\epsilon-(1-\epsilon)\log_{2}(1-\epsilon) is the binary entropy function. Since the state of VV conditioned on X=xX=x is ρV|x/Pr[X=x]=2nρV|x\rho_{V|x}/\Pr[X=x]=2^{n}\rho_{V|x}, we have

𝐈ρ(X;V)\displaystyle\mathbf{I}_{\rho}\left(X;V\right) =𝐄xX[𝐒(ρV)𝐒(2nρV|x)]\displaystyle=\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}\left[\mathbf{S}(\rho_{V})-\mathbf{S}(2^{n}\rho_{V|x})\right]
12q𝐄xXρV2nρV|xTr+𝐄xXh(12ρV2nρV|xTr)\displaystyle\leq\frac{1}{2}q\cdot\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}\big{\|}\rho_{V}-2^{n}\rho_{V|x}\big{\|}_{\mathrm{Tr}}+\mathop{\mathbf{E}\mbox{}}\limits_{x\sim X}h\left(\frac{1}{2}\big{\|}\rho_{V}-2^{n}\rho_{V|x}\big{\|}_{\mathrm{Tr}}\right)
12qρXVρXρVTr+h(12ρXVρXρVTr)\displaystyle\leq\frac{1}{2}q\cdot\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}+h\left(\frac{1}{2}\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}\right)
12qρXVρXρVTr+2ρXVρXρVTr,\displaystyle\leq\frac{1}{2}q\cdot\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}+\sqrt{2\big{\|}\rho_{XV}-\rho_{X}\otimes\rho_{V}\big{\|}_{\mathrm{Tr}}},

as hh is concave and h(ϵ)2ϵh(\epsilon)\leq 2\sqrt{\epsilon}. Now let τV\tau_{V} be the optimal density operator in the definition of ξρ(X;V)\xi^{\rho}(X;V). Plugging in Eq. 16, we conclude that

𝐈ρ(X;V)qξρ(X;V)+2ξρ(X;V).\mathbf{I}_{\rho}\left(X;V\right)\leq q\cdot\xi^{\rho}(X;V)+2\sqrt{\xi^{\rho}(X;V)}.\qed
Lemma C.2.

For every classical-quantum system ρAXV\rho_{AXV} with the qq-qubit quantum subsystem VV that depends only on AA, we have

𝐈ρ(X;M(A,X),V)2(n+q)2r/2.\mathbf{I}_{\rho}\left(X;M(A,X),V\right)\leq 2(n+q)\cdot 2^{-r/2}.
Proof.

Since 𝐈ρ(X;V)=0\mathbf{I}_{\rho}\left(X;V\right)=0, it suffices to bound 𝐈ρ(X;M(A,X)V)𝐈ρ(M(A,X);X,V)\mathbf{I}_{\rho}\left(X;M(A,X)\mid V\right)\leq\mathbf{I}_{\rho}\left(M(A,X);X,V\right). To bound the later, we first notice that since MM is a strong (q,r)(q,r)-quantum extractor,

ξρ(M(A,X);X,V)\displaystyle\xi^{\rho}(M(A,X);X,V) ρM(A,X)XVρM(A,X)ρXρVTr\displaystyle\leq\left\|\rho_{M(A,X)XV}-\rho_{M(A,X)}\otimes\rho_{X}\otimes\rho_{V}\right\|_{\mathrm{Tr}}
ρM(A,X)XVUρXρVTr+|𝐄[M(A,X)]|\displaystyle\leq\left\|\rho_{M(A,X)XV}-U\otimes\rho_{X}\otimes\rho_{V}\right\|_{\mathrm{Tr}}+\left|\mathop{\mathbf{E}\mbox{}}\limits[M(A,X)]\right|
22r.\displaystyle\leq 2\cdot 2^{-r}.

As the total dimension of XX and VV is 2n+q2^{n+q}, by Lemma C.1 we have

𝐈ρ(X;M(A,X),V)\displaystyle\mathbf{I}_{\rho}\left(X;M(A,X),V\right) 𝐈ρ(M(A,X);X,V)\displaystyle\leq\mathbf{I}_{\rho}\left(M(A,X);X,V\right)
(n+q)ξρ(M(A,X);X,V)+2ξρ(M(A,X);X,V)\displaystyle\leq(n+q)\cdot\xi^{\rho}(M(A,X);X,V)+2\sqrt{\xi^{\rho}(M(A,X);X,V)}
5(n+q)2r/2.\displaystyle\leq 5(n+q)\cdot 2^{-r/2}.\qed
Lemma C.3.

For every classical-quantum system ρAXV\rho_{AXV} with q/2q/2-qubit quantum subsystem VV that depends only on AA and M(A,X)M(A,X), we have

ξρ(X;V)4n+q2r/4.\xi^{\rho}(X;V)\leq 4\sqrt{n+q}\cdot 2^{-r/4}.
Proof.

Let W=ρa,0ρa,1W=\rho_{a,0}\otimes\rho_{a,1}, where ρa,b\rho_{a,b} is the density matrix of VV when A=aA=a and M(A,X)=bM(A,X)=b. Then WW is a qq-bit quantum system that depends only on AA. Since VV can be decided from M(A,X)M(A,X) and WW, we have

ξρ(X;V)22𝐈ρ(X;V)2𝐈ρ(X;M(A,X)),W)10(n+q)2r/2.\xi^{\rho}(X;V)^{2}\leq 2\mathbf{I}_{\rho}\left(X;V\right)\leq 2\mathbf{I}_{\rho}\left(X;M(A,X)),W\right)\leq 10(n+q)\cdot 2^{-r/2}.\qed

We are now ready to prove Theorem 3. Let Φt,a,b\Phi_{t,a,b} be the quantum channel applied on VV at stage tt with sample (a,b)(a,b), and recall that the evolution of the system ρXV(t)\rho^{(t)}_{XV} can be expressed as

ρXV(t+1)=𝐄aA[x𝒳|xx|Φt,a,M(a,x)(ρV|x(t))].\rho^{(t+1)}_{XV}=\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{V|x}\big{)}\right].
Proof of Theorem 3.

We are going to bound the increment of ξt\xi_{t}, which is the shorthand for ξρ(t)(X;V)\xi^{\rho^{(t)}}(X;V). For now let us focus on some stage tt, and let τ\tau be the density operator that minimizes ξt=ρXV(t)ρX(t)τTr\xi_{t}=\big{\|}\rho^{(t)}_{XV}-\rho^{(t)}_{X}\otimes\tau\big{\|}_{\mathrm{Tr}}. Notice that ρX(t)=ρX=2n𝕀X\rho^{(t)}_{X}=\rho_{X}=2^{-n}\mathbb{I}_{X} for every tt.

Since τ\tau is a fixed quantum state, we can prepare τ\tau and apply ΦA,M(A,X)\Phi_{A,M(A,X)} on τ\tau to obtain a new quantum register VV^{\prime}, which depends only on AA and M(A,X)M(A,X). Notice that

ρXV=𝐄aA[x𝒳|xx|Φt,a,M(a,x)(τ)],\rho_{XV^{\prime}}=\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\left[\sum_{x\in\mathcal{X}}|x\rangle\langle x|\otimes\Phi_{t,a,M(a,x)}(\tau)\right],

and therefore similarly to Lemma 4.14 (that evolution does not increase trace distance), we can show that

ρXV(t+1)ρXVTr\displaystyle\left\|\rho^{(t+1)}_{XV}-\rho_{XV^{\prime}}\right\|_{\mathrm{Tr}} 𝐄aAx𝒳Φt,a,M(a,x)(ρV|x(t))Φt,a,M(a,x)(τ)Tr\displaystyle\leq\mathop{\mathbf{E}\mbox{}}\limits_{a\sim A}\sum_{x\in\mathcal{X}}\left\|\Phi_{t,a,M(a,x)}\big{(}\rho^{(t)}_{V|x}\big{)}-\Phi_{t,a,M(a,x)}(\tau)\right\|_{\mathrm{Tr}}
x𝒳ρV|x(t)τTrρXV(t)ρXτTr=ξt.\displaystyle\leq\sum_{x\in\mathcal{X}}\big{\|}\rho^{(t)}_{V|x}-\tau\big{\|}_{\mathrm{Tr}}\leq\left\|\rho^{(t)}_{XV}-\rho_{X}\otimes\tau\right\|_{\mathrm{Tr}}=\xi_{t}.

Hence we have

ξt+1\displaystyle\xi_{t+1} ρXV(t+1)ρXρVTr\displaystyle\leq\left\|\rho^{(t+1)}_{XV}-\rho_{X}\otimes\rho_{V^{\prime}}\right\|_{\mathrm{Tr}}
ρXV(t+1)ρXVTr+ρXVρXρVTr\displaystyle\leq\left\|\rho^{(t+1)}_{XV}-\rho_{XV^{\prime}}\right\|_{\mathrm{Tr}}+\big{\|}\rho_{XV^{\prime}}-\rho_{X}\otimes\rho_{V^{\prime}}\big{\|}_{\mathrm{Tr}}
ξt+2ξρ(X;V)\displaystyle\leq\xi_{t}+2\xi^{\rho}(X;V^{\prime})
ξt+8n+q2r/4.\displaystyle\leq\xi_{t}+8\sqrt{n+q}\cdot 2^{-r/4}.

Since ξ0=0\xi_{0}=0, we conclude that

ξT8Tn+q2r/4.\xi_{T}\leq 8T\sqrt{n+q}\cdot 2^{-r/4}.

This value bounds the difference of the success probability of ρ\rho, and that of a quantum branching program whose memory is independent of XX. The later is clearly at most 2n2^{-n}, which finishes the proof. ∎