A quantum parallel Markov chain Monte Carlo

Andrew J. Holbrook UCLA Biostatistics

Abstract

We propose a novel hybrid quantum computing strategy for parallel MCMC algorithms that generate multiple proposals at each step. This strategy makes the rate-limiting step within parallel MCMC amenable to quantum parallelization by using the Gumbel-max trick to turn the generalized accept-reject step into a discrete optimization problem. When combined with new insights from the parallel MCMC literature, such an approach allows us to embed target density evaluations within a well-known extension of Grover’s quantum search algorithm. Letting $P$ denote the number of proposals in a single MCMC iteration, the combined strategy reduces the number of target evaluations required from ${\cal O}\left(P\right)$ to ${\cal O}\left(P^{1/2}\right)$ . In the following, we review the rudiments of quantum computing, quantum search and the Gumbel-max trick in order to elucidate their combination for as wide a readership as possible.

1 Introduction

Parallel MCMC techniques use multiple proposals to obtain efficiency gains over MCMC algorithms such as Metropolis-Hastings (Metropolis et al., 1953; Hastings, 1970) and its progeny that use only a single proposal. Neal (2003) first develops efficient MCMC transitions for inferring the states of hidden Markov models by proposing a ‘pool’ of candidate states and using dynamic programming to select among them. Next, Tjelmeland (2004) considers inference in the general setting and shows how to maintain detailed balance for an arbitrary number $P$ of proposals. Consider a probability distribution $\pi(\mbox{d}\bm{\theta})$ defined on $\mathbb{R}^{D}$ that admits a probability density $\pi(\bm{\theta})$ with respect to the Lebesgue measure, i.e., $\pi(\mbox{d}\bm{\theta})=:\pi(\bm{\theta})\mbox{d}\bm{\theta}$ . To generate samples from the target distribution $\pi$ , we craft a kernel $P(\bm{\theta}_{0},\mbox{d}\bm{\theta})$ that satisfies

\displaystyle\pi(A)=\int\pi(\mbox{d}\bm{\theta}_{0})P(\bm{\theta}_{0},A)

(1)

for all $A\subset\mathbb{R}^{D}$ . Letting $\bm{\theta}_{0}$ denote the current state of the Markov chain, Tjelmeland (2004) proposes sampling from such a kernel $P(\bm{\theta}_{0},\cdot)$ by drawing $P$ proposals $\bm{\Theta}_{-0}=(\bm{\theta}_{1},\dots,\bm{\theta}_{P})$ from a joint distribution $Q(\bm{\theta}_{0},\mbox{d}\bm{\Theta}_{-0})=:q(\bm{\theta}_{0},\bm{\Theta}_{-0})\mbox{d}\bm{\Theta}_{-0}$ and selecting the next Markov chain state from among the current and proposed states with probabilities

\displaystyle\pi_{p}=\frac{\pi(\bm{\theta}_{p})q(\bm{\theta}_{p},\bm{\Theta}_{-p})}{\sum_{p^{\prime}=0}^{P}\pi(\bm{\theta}_{p^{\prime}})q(\bm{\theta}_{p^{\prime}},\bm{\Theta}_{-p^{\prime}})}\,,\quad p=0,\dots,P\,.

(2)

Here, if $\bm{\Theta}=(\bm{\theta}_{0},\dots,\bm{\theta}_{P})$ is the $D\times(P+1)$ matrix having the current state and all $P$ proposals as columns, then $\bm{\Theta}_{-p}$ is the $D\times P$ matrix obtained by removing the $p$ th column. Tjelmeland (2004) shows that the kernel $P(\bm{\theta}_{0},\cdot)$ constructed in such a manner maintains detailed balance and hence satisfies (1). Others have since built on this work, developing parallel MCMC methods that generate or recycle multiple proposals (Frenkel, 2004; Delmas and Jourdain, 2009; Calderhead, 2014; Yang et al., 2018; Luo and Tjelmeland, 2019; Schwedes and Calderhead, 2021; Holbrook, 2021). Most recently, Glatt-Holtz et al. (2022) place all these developments in their natural measure theoretic context, allowing one to trivially apply (1) and (2) to distributions over discrete-valued random variables.

Taken together, these developments demonstrate the ability of parallel MCMC methods to alleviate inferential challenges such as multimodality and to deliver performance gains over single-proposal competitors as measured by reduced autocorrelation between samples. In the following, we focus on the specification presented in Algorithm 1 but note that the techniques we present may also be effective for generalizations of this basic algorithm.

Data: Initial Markov chain state

\bm{\theta}^{(0)}

; total length of Markov chain

S

; total number of proposals per iteration

P

; routine for evaluating target density

\pi(\cdot)

; routines for drawing random samples from the proposal distribution

Q(\bm{\theta}^{(s)},\cdot)

and from a

P+1

discrete distribution

Discrete(\bm{\pi})

given some probability vector

\bm{\pi}=(\pi_{0},\dots,\pi_{P})

Result: A Markov chain

\bm{\theta}^{(1)},\dots,\bm{\theta}^{(S)}

for $s\in\{1,\dots,S\}$ do

\bm{\theta}_{0}\leftarrow\bm{\theta}^{(s-1)}

;

\bm{\Theta}_{-0}\leftarrow Q(\bm{\theta}_{0},\cdot)

;

for $p\in\{0,\dots,P\}$ do

\pi_{p}\leftarrow\pi(\bm{\theta}_{p})\,q(\bm{\theta}_{p},\bm{\Theta}_{-p})

;

end for

\hat{p}\leftarrow Discrete(\bm{\pi}/\bm{\pi}^{T}\mathbb{1})

;

\bm{\theta}^{(s)}\leftarrow\bm{\theta}_{\hat{p}}

;

end for

return $\bm{\theta}^{(1)},\dots,\bm{\theta}^{(S)}$ .

Algorithm 1 Parallel (multiproposal) MCMC (Tjelmeland, 2004)

One need not use parallel computing to implement Algorithm 1, but the real promise and power of parallel MCMC comes from its natural parallelizability (Calderhead, 2014). Contemporary hardware design emphasizes architectures that enable execution of multiple mathematical operations simultaneously. Parallel MCMC techniques stand to leverage technological developments that keep modern computation on track with Moore’s Law, which predicts that processing power doubles every two years. For example, the algorithm of Tjelmeland (2004) generates $P$ conditionally independent proposals and then evaluates the probabilities of (2). One may parallelize the proposal generation step using parallel pseudorandom number generators (PRNG) such as those advanced in Salmon et al. (2011). The computational complexity of the target evaluations $\pi(\bm{\theta}_{p})$ is linear in the number of proposals. This presents a significant burden when $\pi(\cdot)$ is computationally expensive, e.g., in big data settings or for Bayesian inversion, but evaluation of the target density over the $P$ proposals is again a naturally parallelizable task. Moreover, widely available machine learning software such as TensorFlow allows users to easily parallelize both random number generation and target evaluations on general purpose graphics processing units (GPU) (Lao et al., 2020). Finally, when generating independent proposals using a proposal distribution of the form $q(\bm{\theta}_{0},\bm{\Theta}_{-0})=\prod_{p=1}^{P}q(\bm{\theta}_{0},\bm{\theta}_{p})$ , the acceptance probabilities (2) require the ${\cal O}\left(P^{2}\right)$ evaluation of the $P+1\choose 2$ terms $q(\bm{\theta}_{p},\bm{\theta}_{p^{\prime}})$ , but Holbrook et al. (2021a, b) demonstrate the natural parallelizability of such pairwise operations, obtaining orders-of-magnitude speedups with contemporary GPUs. The proposed method directly addresses the acceptance step of Algorithm 1, while leaving the choice of parallelizing (or not parallelizing) the proposal step to the practitioner.

While parallel MCMC algorithms are increasingly well-suited for developing many-core computational architectures, there are trade-offs that need to be taken into account when choosing how to allocate computational resources. On one end of the spectrum, Gelman and Rubin (1992a, b) demonstrate the usefulness of generating, combining and comparing multiple independent Markov chains that target the same distribution, and one may perform this embarrassingly parallel task by assigning the operations for each individual chain to a separate central processing unit (CPU) core or GPU work-group. In this multi-chain context, simultaneously running multiple parallel MCMC chains could limit resources available for the within-chain parallelization described in the previous paragraph. On the other end of the spectrum, one may find it useful to allocate resources to overcome computational bottlenecks within a single Markov chain that uses a traditional accept-reject step. In big data contexts, Holbrook et al. (2021a, b, 2022a, 2022b) use multi- and many-core processing to accelerate log-likelihood and log-likelihood gradient calculations within single Metropolis-Hastings and Hamiltonian Monte Carlo (Neal, 2011) generated chains. This big data strategy might again limit resources available for parallelization across proposals and target evaluations described above.

In the presence of these trade-offs, it is worthwhile to develop additional parallel computing tools for parallel MCMC so that future scientists may be better able to flexibly assign sub-routines to different computing resources in a way that is tailored to their specific inference task. In particular, we show that quantum parallelization can be a useful tool for parallel MCMC when evaluation of the target density represents the computational bottleneck.

Quantum computers use quantum mechanics to store information and manipulate data. By leveraging the idiosyncracies of quantum physics, these computers are able to deliver speedups over conventional computing for a relatively small number of computational problems. Some of these speedups are very large. Quantum computers can achieve exponential speedups over conventional computers for some computational tasks.. Shor’s quantum algorithm for factoring an integer $N$ (Shor, 1994) is polynomial in $\log N$ compared to a super-polynomial classical optimum. The HHL algorithm for solving sparse $N$ -dimensional linear systems (Harrow et al., 2009) is ${\cal O}\left(\log N\right)$ compared to a classical ${\cal O}\left(N\right)$ . Other algorithms deliver a still impressive polynomial speedup. For example, the algorithms considered in the following (Section 2.2) achieve quadratic speedups over conventional computing, turning ${\cal O}\left(N\right)$ runtimes into $\mathcal{O}(\sqrt{N})$ . Both the magnitude of quantum computing’s successes and the relative rarity of those successes mean that there is a general interest in leveraging quantum algorithms for previously unconsidered computational problems. We will show that—with the help of the Gumbel-max trick (Section B)—established quantum optimization techniques are actually useful for sampling from computationally expensive discrete distributions and apply this insight to achieving quadratic speedups for parallel MCMC.

We provide a short introduction to the most basic elements of quantum computing in Appendix A. The interested reader may look to Nielsen and Chuang (2002) for a canonical introduction to quantum algorithms or Lopatnikova and Tran (2021); Wang and Liu (2022) for surveys geared toward statisticians. In Section 2.1, we review and compare three different quantum search algorithms that enjoy quadratic speedups over conventional algorithms. In Section 2.2 we review the quantum minimization algorithm of Durr and Hoyer (1996) and investigate the use of warm-starting therein. We then combine the Gumbel-max trick (Section B) and recent insights in parallel MCMC with the quantum minimization algorithm to create the quantum parallel MCMC algorithm for both continuously-valued (Section 3.2) and discrete-valued (Section 3.3) targets.

2 Quantum search and quantum minimization

Grover (1996) demonstrates how use a quantum computer to find a single marked element within a finite set of $N$ items with only $\mathcal{O}(\sqrt{N})$ queries, and a result from Bennett et al. (1997) shows that Grover’s algorithm is optimal up to a multiplicative constant. Boyer et al. (1998) extend Grover’s algorithm to multiple marked items or solutions, further extend it to the case when the number of solutions is unknown and provide a rigorous bounds on the algorithms’ error rates. Finally, Durr and Hoyer (1996) use the results of Boyer et al. (1998) to obtain the minimum value within a discrete ordered set. Here, we briefly review these advances and provide illustrative simulations. In Section B, the Gumbel-max trick allows us to extend these results to the problem of sampling from a discrete distribution.

2.1 Quantum search

Data: An oracle gate

\mathbf{U}_{f}

taking

\ket{x}\ket{y}

\ket{x}\ket{y\oplus f(x)}

for a function

f(x):\{0,\dots,N-1\}\rightarrow\{0,1\}

that satisfies

f(x_{0})=1

for a single

x_{0}

;

n+1=\log_{2}(N)+1

quantum states initialized to

\ket{0}^{\otimes n}\ket{1}

; an integer

R=\lceil\pi\sqrt{N}/4\rceil

denoting the number of Grover iterations.

Result: An

n

-bit binary string

x_{0}

satisfying

f(x_{0})=1

with error less than

1/N

\ket{0}^{\otimes n}\ket{1}\longrightarrow\mathbf{H}^{\otimes n+1}\ket{0}^{\otimes n}\ket{1}=\ket{h}\ket{-}

; /*

\ket{h}=\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}

. */

\ket{h}\ket{-}\longrightarrow\mathbf{G}^{R}\ket{h}\ket{-}\approx\ket{x_{0}}\ket{-}

; /*

\mathbf{G}=\Big{(}2\ket{h}\bra{h}-\mathbf{I}\Big{)}\Big{(}\mathbf{I}-2\ket{x_{0}}\bra{x_{0}}\Big{)}

\ket{x_{0}}\longrightarrow x_{0}

;

return $x_{0}$ .

Algorithm 2 Quantum search algorithm (Grover, 1996)

Refer to caption — Figure 1: Curves depict the probability that Grover’s quantum search algorithm returns a solution as a function of the number of oracle evaluations. For the ranges depicted, the number of iterations required and the number of solutions inversely relate, but increasing the number of iterations past R (Equation 4) can backfire and lead to extremely small probabilities of success.

Given a set of $N$ items and a function $f:\{0,1,\dots,N-1\}\rightarrow\{0,1\}$ that evaluates to $1$ for a single element, Grover (1996) develops an algorithm that uses quantum parallelism to score quadratic speedups over its classical counterparts. After only $\mathcal{O}(\sqrt{N})$ evaluations of $f(\cdot)$ , Grover’s algorithm returns the $x\in\{0,1,\dots,N-1\}$ satisfying $f(x)=1$ with high probability. Compare this to ${\cal O}\left(N\right)$ requirement for the randomized classical algorithm that must evaluate $f(\cdot)$ over at least $N/2$ items to obtain the same probability of detecting $x$ . The algorithm takes the state $\ket{0}^{\otimes N}\ket{1}$ as input and applies Hadamard gates to each of the individual $N+1$ input qubits. The resulting state is

\displaystyle\ket{h}\ket{-}=\left(\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}\right)\frac{1}{\sqrt{2}}\left(\ket{0}-\ket{1}\right)=\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}\ket{-}\,.

Next, we apply the oracle gate $\mathbf{U}_{f}:\ket{x}\ket{y}\rightarrow\ket{x}\ket{y\oplus f(x)}$ and note that

	$\displaystyle\mathbf{U}_{f}\ket{x}\ket{-}$	$\displaystyle=\mathbf{U}_{f}\ket{x}\frac{1}{\sqrt{2}}\left(\ket{0}-\ket{1}\right)=\frac{1}{\sqrt{2}}\big{(}\ket{x}\ket{0\oplus f(x)}-\ket{x}\ket{1\oplus f(x)}\big{)}$
		$\displaystyle=-1^{f(x)}\ket{x}\ket{-}\,.$

Thus, $\mathbf{U}_{f}$ flips the sign for the state $\ket{x_{0}}$ for which $f(x_{0})=1$ but leaves the other states unchanged. If we suppress the ancillary qubit $\ket{-}$ , then $\mathbf{U}_{f}$ is equivalent to the gate $\mathbf{U}_{x_{0}}$ defined as $\mathbf{U}_{x_{0}}\ket{x}=-1^{\delta_{x_{0}}}\ket{x}$ . We may succinctly write this gate as the Householder matrix that reflects vectors about the unique hyperplane through the origin that has $\ket{x_{0}}$ for a normal vector:

\displaystyle\mathbf{U}_{x_{0}}=\mathbf{I}-2\ket{x_{0}}\bra{x_{0}}\,.

The action of this gate on the state $\ket{h}$ takes the form

\displaystyle\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}\longrightarrow\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}-1^{\delta_{x_{0}}(x)}\ket{x}\,.

Next, the algorithm reflects the current state about the hyperplane that has $\ket{h}$ as a normal vector and negates the resulting state:

	$\displaystyle\Big{(}2\ket{h}\bra{h}-\mathbf{I}\Big{)}\left(\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}-1^{\delta_{x_{0}}(x)}\ket{x}\right)$	$\displaystyle=\frac{1}{\sqrt{N}}\sum_{x}\left((-1)^{1-\delta_{x_{0}}(x)}+2\frac{(N-2)}{N}\right)\ket{x}$
		$\displaystyle=\frac{(3N-4)}{N^{3/2}}\ket{x_{0}}+\sum_{x\neq x_{0}}\frac{(N-4)}{N^{3/2}}\ket{x}\,.$

The scientist who measures the state at this moment would obtain the desired state $\ket{x_{0}}$ with a slightly higher probability of $(3N-4)^{2}/N^{3}$ than the individual probabilities of $(N-4)^{2}/N^{3}$ for the other states. Each additional application of the Grover iteration

\displaystyle\mathbf{G}:=-\mathbf{U}_{h}\mathbf{U}_{x_{0}}:=\Big{(}2\ket{h}\bra{h}-\mathbf{I}\Big{)}\Big{(}\mathbf{I}-2\ket{x_{0}}\bra{x_{0}}\Big{)}

(3)

increases the probability of obtaining $\ket{x_{0}}$ at the time of measurement in the computational basis and $R=\lceil\pi\sqrt{N}/4\rceil$ iterations guarantees a probability of success that is greater than $1-1/N$ .

In general, we may use Grover’s search algorithm to find a solution when the number of solutions $M$ is greater than 1. While the algorithmic details change little, the number of required Grover iterations

\displaystyle R=\Bigg{\lceil}\frac{\pi}{4}\sqrt{\frac{N}{M}}\Bigg{\rceil}

(4)

and the probability of success after those $R$ iterations do change (Boyer et al., 1998). When $M$ is much smaller than $N$ , the success rate is greater than $1-M/N$ , and even for large $M$ the success rate is more than $1/2$ . Lower-bounds are useful for establishing mathematical guarantees, but it is also helpful to understand the quality of algorithmic performance as a function of $M$ and $R$ . Figure 1 shows success curves as a function of the number of Grover iterations (or oracle calls) applied. The search field contains $2^{14}\approx 16$ k elements, and the number of solutions $M$ varies. Each curve represents an individual search task. The algorithm requires more iterations for smaller numbers and fewer iterations for larger numbers of solutions. The upper bound on error $M/N$ is only an upper bound: the final probability of success for, e.g., $M=256$ is 0.997 compared to the bound of 0.984.

Data: An oracle gate

\mathbf{U}_{f}

taking

\ket{x}\ket{y}

\ket{x}\ket{y\oplus f(x)}

for a function

f(x):\{0,\dots,N-1\}\rightarrow\{0,1\}

with unknown number of solutions;

n=\log_{2}(N)

Result: If a solution exists, an

n

-bit binary string

x_{0}

satisfying

f(x_{0})=1

; if no solution exists, the algorithm runs forever.

m\leftarrow 1

;

\gamma\leftarrow 6/5

;

success

\leftarrow

FALSE;

while success $\neq$ TRUE do

j\leftarrow Uniform\{0,\cdots,m-1\}

;

\ket{0}^{\otimes n}\ket{1}\longrightarrow\mathbf{H}^{\otimes n+1}\ket{0}^{\otimes n}\ket{1}=\ket{h}\ket{-}

;

\ket{h}\ket{-}\longrightarrow\mathbf{G}^{j}\ket{h}\ket{-}=\ket{x}\ket{-}

; /*

j

Grover iterations from Alg 2. */

\ket{x}\longrightarrow x

; /* Measure and check. */

if $f(x)=1$ then

x_{0}\leftarrow x

;

success

\leftarrow

TRUE;

else

m\leftarrow\min\left(\gamma m,\sqrt{N}\right)

; /* Increase

m

in case of failure. */

end if

end while

return $x_{0}$ .

Algorithm 3 Quantum exponential searching algorithm (Boyer et al., 1998)

While Grover’s algorithm delivers impressive speedups over classical search algorithms, it has a major weakness. Figure 1 hides the fact that the probability of success for Grover’s algorithm is not monatonic in the number of iterations. Running the algorithm for more than $R$ iterations can backfire. For example, running the algorithm for $\sqrt{2N}$ iterations when $M=1$ results in a probability of success less than $0.095$ (Boyer et al., 1998). The non-monotinicity of Grover’s algorithm becomes particularly problematic when we do not know the number of solutions $M$ . Taking for example $N=2^{20}$ , Boyer et al. (1998) point out that 804 iterations provide an extremely high probability of success when $M=1$ but a one-in-a-million chance of success when $M=4$ . To solve this problem and develop an effective search algorithm when $M$ is unknown, those authors adopt the strategy of focusing on the expected number of iterations before success. In particular, they propose the quantum exponential search algorithm (Algorithm 3). When a solution exists, the algorithm returns a solution with expected total number of Grover iterations bounded above by $\frac{9}{2}\sqrt{N/M}$ . Still better, this upper bound reduces to $\frac{9}{4}\sqrt{N/M}$ for the special case $M\ll N$ , and simulations presented in Figure 2 show that even this bound is large. Such results come in handy when deciding whether to halt the algorithm’s progress if one believes it possible that no solutions exist. Indeed, this turns out to be useful in the context of quantum minimization (Section 2.2).

Algorithm 3 is not the only quantum search algorithm that is useful when the number of solutions $M$ is unknown. Yoder et al. (2014) use generalized Grover iterations that extend (3) to include general phases $(\alpha_{j},\beta_{j})$ :

\displaystyle\mathbf{G}(\alpha_{j},\beta_{j}):=-\mathbf{U}_{h}(\alpha_{j})\mathbf{U}_{x_{0}}(\beta_{j}):=\Big{(}(1-e^{-i\alpha_{j}})\ket{h}\bra{h}-\mathbf{I}\Big{)}\Big{(}\mathbf{I}-(1-e^{i\beta_{j}})\ket{x_{0}}\bra{x_{0}}\Big{)}\,.

(5)

This form reduces to the original Grover iteration (3) when $\alpha=\pm\pi$ and $\beta=\pm\pi$ . To select general phases, the method first fixes an error tolerance $\delta^{2}>0$ and a lower bound $w\leq\lambda=M/N$ and then selects an odd upper bound $L$ on the total number of oracle queries ( $L-1$ ) such that

\displaystyle L\geq\frac{\log(2/\delta)}{\sqrt{w}}\,.

Finally, letting $l=(L-1)/2$ , one obtains phases $(\alpha_{j},\beta_{j})$ for $j=1,\dots,l$ that satisfy

\displaystyle\alpha_{j}=-\beta_{l-j+1}=2\cot^{-1}\left(\tan(2\pi j/L)\sqrt{1-\gamma^{2}}\right)\,,

for $\gamma^{-1}=\cos(\frac{1}{L}\cos^{-1}(\frac{1}{\delta}))$ is the reciprocal of the inverse $L$ th Chebyshev polynomial of the first kind. Such phases mark the only algorithmic difference between the fixed-point quantum search and the original Grover search (Algorithm 2). The upshot is a search procedure that obtains a guaranteed probability of success $1-\delta^{2}$ for any $M$ such that there exists an $M_{0}<M$ that also obtains the same success threshold (Figure 2). On the one hand, this algorithm avoids the exponential quantum search algorithm’s need to perform multiple measurements of the quantum state. On the other hand, the exponential quantum search algorithm requires significantly fewer oracle evaluations when $M$ is small (Figure 3), and it turns out that this is precisely the scenario that interests us.

2.2 Quantum minimization

Data: A quantum sub-routine capable of evaluating a function

f(\cdot)

over

\{0,\dots,N-1\}

with with unique integer values; a maximum error tolerance

\epsilon\in(0,1)

; expected total time to success

m_{0}=\frac{45}{4}\sqrt{N}+\frac{7}{10}\log_{2}(N)

Result: A

\log_{2}(N)

-bit binary string

x_{0}

satisfying

f(x_{0})=\min f

with probability greater than

1-\epsilon

s\leftarrow 0

;

x_{0}\leftarrow Uniform\{0,\dots,N-1\}

;

while $s<m_{0}/\epsilon$ do

Prepare initial state

\frac{1}{\sqrt{N}}\sum_{x}\ket{x}\ket{x_{0}}

;

Mark all items

x

satisfying

f(x)<f(x_{0})

;

s\leftarrow s+\log_{2}(N)

;

Apply quantum exponential searching algorithm (Alg 3); /*

I

time steps */

s\leftarrow s+I

;

Obtain

x^{\prime}

by measuring first register;

if $f(x^{\prime})<f(x_{0})$ then

x_{0}\leftarrow x^{\prime}

end if

end while

return $x_{0}$ .

Algorithm 4 Quantum minimum searching algorithm (Durr and Hoyer, 1996)

Given a function $f(\cdot)$ that maps the discrete set $\{0,\dots,N-1\}$ to the integers, we wish to find the minimizer

\displaystyle x_{0}=\operatornamewithlimits{arg\,min}_{x\in\{0,\dots,N-1\}}f(x)\,.

Durr and Hoyer (1996) propose a quantum algorithm for finding $x_{0}$ that iterates between updating a running minimum $F\in f(\{0,\dots,N-1\})$ and applying the quantum exponential search algorithm (Algorithm 3) to find an element $x$ such that $f(x)<F$ . Having run these iterations a set number of times, the algorithm returns the minimizer $x_{0}$ with high probability. To this end, Durr and Hoyer (1996) show that their algorithm takes an expected total time less than or equal to $m_{0}:=\frac{45}{4}\sqrt{N}+\frac{7}{10}\log_{2}(N)$ to find the minimizer, where marking items with values less than the threshold value (Algorithm 4) requires $\log_{2}(N)$ time steps and each Grover iteration within the quantum exponential search algorithm requires one time step. From there, Markov’s inequality says

\displaystyle\mbox{Pr}\left(\mbox{total time to success}\geq\frac{m_{0}}{\epsilon}\right)\leq\frac{\mbox{E}\left(\mbox{total time to success}\right)}{m_{0}/\epsilon}\leq\epsilon\,,

or that we must scale the minimization procedure’s time steps by a factor of $1/\epsilon$ to reach a failure rate less than $\epsilon$ . Due to the iterative nature of the algorithm, one might suppose that it is beneficial to start at a value $x_{0}$ for which $f(x_{0})$ is lower relative to other values. This turns out to be the case in theory and in practice, and the benefit of warm-starting is particularly useful in the context of parallel MCMC.

Proposition 1 (Warm-starting).

Suppose that the quantum minimization algorithm begins with a threshold $F_{0}$ such that $f(x)<F_{0}$ for only $K-1>0$ items. Then the expected total number of time steps to find the minimizer is bounded above by

\displaystyle m_{0}^{K}=\left(\frac{5}{4}-\frac{1}{\sqrt{K-1}}\right)9\sqrt{N}+\frac{7}{10}\log_{2}(K)\log_{2}(N)\,,

and so the following rule relates the warm-started upper bound to the generic upper bound:

\displaystyle m_{0}^{K}=m_{0}-9\sqrt{\frac{N}{K-1}}+\frac{7}{10}\log_{2}\left(\frac{K}{N}\right)\log_{2}(N)\,.

Proof.

The proof follows the exact same form as Lemma 2 of Durr and Hoyer (1996). It relies on a theoretical algorithm called the infinite algorithm that runs until the minimum is found. In this case, Lemma 1 of that paper says that the probability that the $r$ th lowest value is ever selected when searching among $K$ items is $p(K,r)=1/r$ for $r\leq K$ and 0 otherwise. For a warm-start at element $K$ , the expected total time spent in the exponential search algorithm is

	$\displaystyle\sum_{r=2}^{N}p(K,r)\frac{9}{2}\sqrt{\frac{N}{r-1}}$	$\displaystyle=\sum_{r=2}^{K}p(K,r)\frac{9}{2}\sqrt{\frac{N}{r-1}}=\frac{9}{2}\sqrt{N}\sum_{r=1}^{K-1}\frac{1}{r+1}\frac{1}{\sqrt{r}}$
		$\displaystyle\leq\frac{9}{2}\sqrt{N}\left(\frac{1}{2}+\sum_{r=2}^{K-1}r^{-3/2}\right)\leq\frac{9}{2}\sqrt{N}\left(\frac{1}{2}+\int_{r=1}^{K-1}r^{-3/2}\mbox{d}r\right)$
		$\displaystyle=\left(\frac{5}{4}-\frac{1}{\sqrt{K-1}}\right)9\sqrt{N}\,.$

An upper bound for the expected total number of time steps preparing the initial state and marking items $f(x)<f(x_{0})$ follows in a similar manner. ∎

Proposition 1 shows that, e.g., if Algorithm 4 begins at the item with second lowest value, then the expected total time to success is bounded above by $m_{0}^{2}=\frac{9}{4}\sqrt{N}+\frac{7}{10}\log_{2}(N)$ , reducing the generic upper bound by $9\sqrt{N}-\frac{7}{10}\log_{2}\left(\frac{2}{N}\right)\log_{2}(N)$ time steps. When $N$ equals 1000, say, the expected total time steps is $m_{0}^{2}<78.2$ . Keeping $N=1000$ but letting the algorithm begin at the third lowest value ( $K=3$ ), the expected total time steps before success is $m_{0}^{3}<165.6$ . Raising $N$ to 10,000, the two numbers increase to $m_{0}^{2}<234.4$ and $m_{0}^{3}<503.4$ .

In practice, the number of time steps before finding the minimum is surprisingly small. Figure 4 corroborates the intuition that it is beneficial to start at a lower ranked element. To generate these results, we use an early stopping rule for the quantum minimization algorithm’s exponential searching sub-routine, halting it after $\frac{9}{4}\sqrt{N}$ iterations. Even with this stopping rule in place, the error rate of the quantum minimization algorithm is less than $1\%$ . We can increase the early stopping threshold to obtain even lower error rates, but this strategy would seem to be unnecessary in the context of parallel MCMC. The benefits of warm-starting are useful in the same context, when the current state $\bm{\theta}^{(s)}$ usually inhabits the high-density region of the target distribution but the majority of proposals do not.

3 Quantum parallel MCMC

With the rudiments of quantum minimization in hand, we present a quantum parallel MCMC (QPMCMC). The general structure of the algorithm the same as Algorithm 1: it constructs a Markov chain by iterating between generating many proposals and randomly selecting new Markov chain states among these proposals. Unlike classical single-proposal MCMC, parallel MCMC proposes many states and chooses one according to the generalized acceptance probabilities of (2). In general MCMC, evaluation of the target density function $\pi(\cdot)$ is the rate-limiting computational step, becoming particularly onerous in big data scenarios (Holbrook et al., 2021a). While parallel MCMC benefits from improved mixing as a function of MCMC iterations, the increased computational burden of $P$ target evaluations at each step can lead to less favorable comparisons when accounting for elapsed wall-clock time.

Having successfully generated proposals $\bm{\theta}_{p},$ $p=1,\dots,P$ and evaluated the corresponding proposal densities $q(\cdot,\cdot)$ , we would like to use quantum parallelism and an appropriate oracle gate to efficiently compute the $\pi(\bm{\theta}_{p})$ s but immediately encounter a Catch-22 when we seek to draw a sample $\bm{\theta}^{(s+1)}\sim$ Discrete $(\bm{\pi})$ for $\bm{\pi}$ the vector of probabilities $\pi_{0},\pi_{1},\dots,\pi_{P}$ defined in (2). We can use neither a quantum nor a classical circuit to draw the sample! On the one hand, drawing the sample within a quantum circuit would somehow require that all superposed states have access to all the $\pi(\bm{\theta}_{p})$ s at once to perform the required normalization. On the other hand, drawing the sample within the classical circuit would require access to all the $\pi(\bm{\theta}_{p})$ s, but measurement in the computational basis can only return one.

In light of this dilemma, we propose to use the Gumbel-max trick (Papandreou and Yuille, 2011) to transform the generalized accept-reject step into a discrete optimization procedure (Section B). From there, we use quantum minimization to efficiently sample from $Discrete(\bm{\pi})$ . Crucially, we get away with quantum parallel evaluation of the target $\pi(\cdot)$ over all superposed states because the Gumbel-max trick does not rely on a normalization step: each superposed state requires no knowledge of any target evaluation other than its own.

Once one has effectively parallelized the target evaluations $\pi(\bm{\theta}_{p})$ , the new computational bottleneck is the calculations required to obtain the $P+1$ proposal density terms $q(\bm{\theta}_{p},\bm{\Theta}_{-p})$ , for $p=0,\dots,P$ . Under an independent proposal mechanism with a proposal distribution of the form $q(\bm{\theta}_{0},\bm{\Theta}_{-0})=\prod_{p=1}^{P}q(\bm{\theta}_{0},\bm{\theta}_{p})$ , the acceptance probabilities (2) require the ${\cal O}\left(P^{2}\right)$ evaluation of the $P+1\choose 2$ terms $q(\bm{\theta}_{p},\bm{\theta}_{p^{\prime}})$ . To avoid such calculations, we follow Holbrook (2021) and use symmetric proposal distributions for which $q(\bm{\theta}_{p},\bm{\Theta}_{-p})=q(\bm{\theta}_{p^{\prime}},\bm{\Theta}_{-p^{\prime}})$ for $p,p^{\prime}=0,\dots,P$ .

Data: A vector of unnormalized log-probabilities

\bm{\lambda}^{*}=\log\bm{\pi}+\log c

, for

\bm{\pi}

a discrete probability vector with

P+1

elements.

Result: A single sample

\hat{p}\sim

Discrete

(\bm{\pi})

satisfying

\hat{p}\in\{0,1,\dots,P\}

for $p\in\{0,1,\dots,P\}$ do

z_{p}\leftarrow Gumbel(0,1)

;

\alpha^{*}_{p}\leftarrow\lambda_{p}^{*}+z_{p}

;

end for

\hat{p}\leftarrow\operatornamewithlimits{arg\,max}_{p=0,\dots,P}\alpha^{*}_{p}

;

return $\hat{p}\,$ .

Algorithm 5 The Gumbel-max trick

3.1 Simplified acceptance probabilities

Evaluation of the $P+1$ proposal density terms $q(\bm{\theta}_{p},\bm{\Theta}_{-p})$ can be computationally burdensome as $P$ grows large. Holbrook (2021) proves that two specific proposal mechanisms enforce equality across all $P+1$ terms and thus enable the simplified acceptance probabilities

\displaystyle\pi_{p}=\frac{\pi(\bm{\theta}_{p})}{\sum_{p^{\prime}=0}^{P}\pi(\bm{\theta}_{p^{\prime}})}\,,\quad p=0,\dots,P\,.

(6)

In Section 3.2, we opt for one of these proposal mechanisms, the centered Gaussian proposal of Tjelmeland (2004). This strategy first generates a Gaussian random variable $\bar{\bm{\theta}}$ centered at the current state $\bm{\theta}_{0}$ and then generates $P$ proposals centered at $\bar{\bm{\theta}}$ :

\displaystyle\bm{\theta}_{1},\dots,\bm{\theta}_{P}\stackrel{{\scriptstyle\perp}}{{\sim}}Normal_{D}(\bar{\bm{\theta}},\bm{\Sigma})\,,\quad\bar{\bm{\theta}}\sim Normal_{D}(\bm{\theta}_{0},\bm{\Sigma})\,.

(7)

Holbrook (2021) shows that this strategy leads to the higher-order proposal symmetry

\displaystyle q(\bm{\theta}_{0},\bm{\Theta}_{-0})=q(\bm{\theta}_{1},\bm{\Theta}_{-1})=\dots=q(\bm{\theta}_{P},\bm{\Theta}_{-P})

(8)

and that the simplified acceptance probabilities (6) maintain detailed balance. In fact, other location scale families also suffice in the manner, but here we focus on Gaussian proposals without loss of generality. Finally, the simplicial sampler (Holbrook, 2021) accomplishes the same simplified acceptance probabilities but incurs an $\mathcal{O}(D^{3})$ computational cost.

One may also use a more limited strategy to enforce (8). Glatt-Holtz et al. (2022) show that the sampler using independence proposal $q(\bm{\theta}_{p},\bm{\Theta}_{-p})=\prod_{p^{\prime}\neq p}q(\bm{\theta}_{p^{\prime}})$ , where $q(\cdot)$ is not a function of $\bm{\theta}_{p}$ , results in acceptance probabilities

\displaystyle\pi_{p}=\frac{\pi(\bm{\theta}_{p})/q(\bm{\theta}_{p})}{\sum_{p^{\prime}=0}^{P}\pi(\bm{\theta}_{p^{\prime}})/q(\bm{\theta}_{p^{\prime}})}\,,\quad p=0,\dots,P\,.

(9)

When $\pi(\cdot)$ and $q(\cdot)$ take continuous values on a topologically compact domain or discrete values on a finite domain, one may specify $q(\cdot)$ to be uniform and let (9) simplify to (6). This strategy proves useful in Section 3.3, in which we consider discrete-valued targets over Ising-type lattice models.

3.2 Continuously-valued targets

Data: Initial Markov chain state

\bm{\theta}^{(0)}

; total length of Markov chain

S

; total number of proposals per iteration

P

; routine for evaluating target density

\pi(\cdot)

; routines for drawing random samples from a

D

-dimensional multivariate Gaussian

Normal_{D}(\bm{\mu},\bm{\Sigma})

and the standard Gumbel distribution

Gumbel(0,1)

Result: A Markov chain

\bm{\theta}^{(1)},\dots,\bm{\theta}^{(S)}

for $s\in\{1,\dots,S\}$ do

\bm{\theta}_{0}\leftarrow\bm{\theta}^{(s-1)}

;

z_{0}\leftarrow Gumbel(0,1)

;

\bar{\bm{\theta}}\leftarrow Normal_{D}(\bm{\theta}_{0},\bm{\Sigma})

; /* Proposal (7) */

for $p\in\{1,\dots,P\}$ do

\bm{\theta}_{p}\leftarrow Normal_{D}(\bar{\bm{\theta}},\bm{\Sigma})

;

z_{p}\leftarrow Gumbel(0,1)

;

\ket{\theta_{p}}\leftarrow\bm{\theta}_{p}

; /* Load proposal onto quantum computer. */

end for

\hat{p}\leftarrow\operatornamewithlimits{arg\,min}_{p=0,\dots,P}\Big{(}f(p):=-\big{(}z_{p}+\log\pi(\bm{\theta}_{p})\big{)}\Big{)}

;

{\cal O}\left(\sqrt{P}\right)

Alg 4 with warm-start at

p=0

. */

\bm{\theta}^{(s)}\leftarrow\bm{\theta}_{\hat{p}}

;

end for

return $\bm{\theta}^{(1)},\dots,\bm{\theta}^{(S)}$ .

Algorithm 6 Quantum parallel MCMC for a continuously-valued target

Algorithm 6 presents the details of QPMCMC for a continuously-valued target. The algorithm uses conventional (perhaps parallel) computing almost entirely, excluding two lines that are highlighted. The first of these loads proposals onto the quantum machine, and the second line that calls the quantum minimization algorithm presented in Algorithm 4. This sub-routine seeks to find the minimizer of the function

\displaystyle f(p)=-\big{(}z_{p}+\log\pi(\bm{\theta}_{p})\big{)}\quad\mbox{over}\quad p=0,1,\dots,P,

(10)

which combines the numerators of (2), the Gumbel-max trick and a trivial negation. As discussed in Section 2.2, the sub-routine requires only $\mathcal{O}(\sqrt{P})$ oracle evaluations and, therefore, only $\mathcal{O}(\sqrt{P})$ evaluations of the target density $\pi(\cdot)$ using a quantum circuit. In theory, a quantum computer can perform the same computations as a classical computer, but the efficiency of the target evaluations would depend on a number of factors related to the structure of the density itself, the availability of fast quantum random access memory (Giovannetti et al., 2008) and the ability of large numbers of quantum gates to act in concert with negligible noise (Kielpinski et al., 2002; Erhard et al., 2019).

In general MCMC, one often calibrates the scaling of a proposal kernel to balance between exploring the target’s domain and remaining within high-density regions. Optimal scaling strategies may lead to a large number of rejected proposals (Roberts and Rosenthal, 2001). Indeed, Holbrook (2021) shows that parallel MCMC algorithms are no exception. When using the Gumbel-max trick to sample from proposals, this means that the current state is often quite near optimality, representing a warm-start. Figure 5 shows how this warm-starting effects the number of oracle calls (and hence target evaluations) within the quantum minimization sub-routine over the course of an MCMC trajectory. We target multi-dimensional standard normal distributions of differing dimensions using the vector $(100,\dots,100)$ as starting position. We fix the number of Gaussian proposals to be 2,000 and use standard adaptation procedures (Rosenthal, 2011) to target a 50% acceptance rate while guaranteeing convergence. Although no theoretical results validate this target acceptance rate, the rate is close to empirically optimal rates from Holbrook (2021). Across iterations, QPMCMC requires relatively few oracle calls compared to the 2,000 target evaluations of parallel MCMC. We also witness the expected drop in the number of oracle evaluations as the chain approaches the target’s high-density region. Remarkably, the algorithm succeeds in sampling from the true discrete distribution in 99.5% of the MCMC iterations. An independent simulation obtains similar results, requiring roughly 7% of the usual number of target evaluations. Figure 6 shows a quantile-quantile plot for all 100 dimensions of a multi-dimensional standard normal distribution. We see no appreciable impact from the rare failure of the quantum minimization sub-routine.

Finally, we compare convergence speed for QPMCMC using 1,000, 5,000 and 10,000 proposals to target a massively multimodal distribution: a mixture of 1,000 2-dimensional, isotropic Gaussians with standard deviations equal to 1 and means equal to $10\times(\bm{0},\bm{1},\dots,\bm{999})$ . This target is particularly challenging because the distance between modes is significantly larger than standard deviations. In 5 independent simulations, we run each algorithm until it achieves and effective sample size of 100. Table 1 contains MCMC iterations and target evaluations required to achieve this effective sample size as well as relative speedups over sequential implementations and relative improvements over the 1,000 proposal sampler.

Proposals	MCMC iterations	Target evaluations	Speedup	Efficiency gain
1,000	249,398 (200,998, 311,998)	24,988,963 (20,149,132, 31,265,011)	9.98 (9.98, 9.98)	1
5,000	14,398 (12,998, 16,998)	3,314,560 (2,989,418, 3,916,281)	21.72 (21.70, 21.74)	7.58 (6.25, 9.71)
10,000	5,998 (4,998, 6,998)	1,993,484 (1,662,592, 2,330,842)	30 (29.96, 30.26)	12.87 (8.64, 18.80)

Table 1: Racing to a minimum (between the two dimensions) of 100 effective samples for a target with 1,000 disjoint modes. ‘Speedup’ is ratio between target evaluations required for sequential and quantum implementations. ‘Efficiency gain’ is ratio between target evaluations required for 1,000 proposal and 5,000/10,000 proposal implementations. We report means (minima, maxima) across 5 independent runs.

3.3 Discrete-valued targets

We wish to sample from a target distribution defined over a discrete set $\{\bm{\theta}_{\alpha}\}_{\alpha\in\mathcal{A}}$ , for $\mathcal{A}$ some finite or countably infinite set of indices. Glatt-Holtz et al. (2022) establish the broader measure theoretic foundations for parallel MCMC procedures that use selection probabilities (2) to maintain detailed balance. In particular, when the target distribution has probability mass function $\pi(\cdot)$ defined with respect to the counting measure on the power set $2^{\mathcal{A}}$ , then detailed balance results in the kernel $P(\cdot,\cdot)$ satisfying

\displaystyle\pi(\alpha)=\sum_{{\alpha^{\prime}}}\pi(\alpha^{\prime})P(\alpha^{\prime},\alpha)\,,\quad\forall\alpha,\alpha^{\prime}\in\mathcal{A}\,.

Sections 3.3.1 and 3.3.2 consider targets defined over finite sets $\mathcal{A}$ and make use of the uniform independence proposal scheme described in Section 3.1, thereby facilitating simplified acceptance probabilities (6). This scheme proposes single-flip updates to the current Markov chain state $\bm{\theta}_{0}$ , but it is worth noting that other multiple-flip schemes would also be amenable to QPMCMC.

3.3.1 Ising model on a 2-dimensional lattice

Here, we are interested in an Ising-type lattice model over configurations $\bm{\theta}=(\theta_{1},\dots,\theta_{D})$ consisting of $D$ individual spins $\theta_{d}\in\{-1,1\}$ . In terms of the preceding section, we have $\mathcal{A}=\{-1,1\}^{D}$ and $|\mathcal{A}|=2^{D}$ . In particular, we consider targets of the form

\displaystyle\pi(\bm{\theta}|\rho)

\displaystyle\propto\exp\left(\rho\sum_{(d,d^{\prime})\in\mathcal{E}}\theta_{d}\theta_{d^{\prime}}\right)\,,

(11)

where $\rho>0$ is the interaction term and $\mathcal{E}$ is the lattice edge set. We specify a two-dimensional 500-by-500 lattice with nearest neighbors connections and fix $\rho=1$ . Since this latter setting encourages neighboring spins to equal each other, we begin our QPMCMC trajectories at the lowest-probability ‘checkerboard’ state for an initial configuration. At each iteration, we generate collections of proposal states by uniformly flipping $P$ individual spins $\theta_{p}$ , $p\in\{1,\dots,P\}$ , and obtaining each proposal state $\bm{\theta}_{p}$ by updating the current state $\bm{\theta}_{0}$ at the single location corresponding to $\theta_{p}$ . Figure 7 shows results from 10 independent runs using $P\in\{4,8,16,\dots,2048\}$ proposals. Trace plots of unnormalized log-probabilities indicate that higher proposal counts enable discovery of higher probability configurations. Interestingly, QPMCMC appears to be particularly beneficial in this large $P$ context, requiring less than one-tenth the target evaluations required using conventional computing when $P=2048$ .

3.3.2 Bayesian image analysis

We apply a Bayesian image classification model to an intensity map (Figure 8) of the newly imaged supermassive black hole, Sagittarius A*, located at the Galactic Center of the Milky Way (Akiyama et al., 2022). Whereas one cannot see the black hole itself, one may see the shadow of the black hole cast by the hot, swirling cloud of gas surrounding it. We extract the black hole from the surrounding cloud by modeling the intensity at each of the $D=$ 4,076 ${}^{2}=$ 16,613,776 pixels as belonging to a mixture of two truncated normals with values $y_{d}$ restricted to the intensity range $[0,255]$ . Namely, we follow Hurn (1997) and specify the latent variable model

	$\displaystyle y_{d}\|(\mu_{\ell},\sigma^{2},\theta_{d})$	$\displaystyle\stackrel{{\scriptstyle ind}}{{\sim}}\mbox{Normal}(\mu_{\ell},\sigma^{2})\,,\quad y_{d}\in[0,255]\,,\quad\theta_{d}=\ell\,,\quad d\in\{1,\dots,D\}\,,$
	$\displaystyle\mu_{\ell}$	$\displaystyle\stackrel{{\scriptstyle iid}}{{\sim}}\mbox{Uniform}(0,255)\,,\quad\ell\in\{-1,1\}\,,$
	$\displaystyle\frac{1}{\sigma^{2}}$	$\displaystyle\sim\mbox{Gamma}\left(\frac{1}{2},\frac{1}{2}\right)\,,$

where $\bm{\theta}=(\theta_{1},\dots,\theta_{D})$ share for a prior the Ising model (11) with edges joining neighboring pixels and interaction $\rho=1.2$ .

We use a QPMCMC-within-Gibbs scheme to infer the join posterior of $\bm{\theta}$ and the three mixture model parameters. For the former, we use the same QPMCMC scheme as in Section 3.3.1 with 1,024 proposals at each iteration. For the latter, we use the closed-form updates presented in Hurn (1997). We run this scheme for 20 million iterations, discarding the first 10 million as burnin. We thin the remaining sample at a ratio of 1 to 40,000 for the latent memberships $\bm{\theta}$ and 1 to 4,000 for the three parameters $\mu_{-1}$ , $\mu_{1}$ and $\sigma^{2}$ . Using the R package coda (Plummer et al., 2006), we calculate effective sample sizes of the log-posterior (257.1), $\mu_{-1}$ (1,578.7), $\mu_{1}$ (257.6) and $\sigma^{2}$ (2,500.0), suggesting adequate convergence. Figure 8 shows both the intensity data and the pixelwise posterior mode of the latent membership vector $\bm{\theta}$ . The QPMCMC requires only 1,977,553,608 target evaluations compared to the 1,024 $\times$ 20,000,000 $=$ 2.048 $\times 10^{10}$ evaluations required for the analogous parallel MCMC scheme implemented on a conventional computer, representing a 10.36-fold speedup.

4 Discussion

We have shown that parallel MCMC enjoys quadratic speedups by combining quantum minimization with the Gumbel-max trick. Within a QPMCMC iteration, the current state represents a warm-start for the sampling-turned-optimization problem, leading to increased efficiencies for the quantum minimization algorithm. Moreover, combining this approach with the Tjelmeland correction (Holbrook, 2021) results in a total complexity reduction from $\mathcal{O}(P^{2})$ to $\mathcal{O}(\sqrt{P})$ . Preliminary evidence suggests that our strategy may find use when target distributions exhibit extreme multimodality. While the algorithm still must construct long Markov chains to reach equilibrium, generating each individual Markov chain state requires significantly fewer target evaluations.

There are major technical barriers to the practical implementation of QPMCMC. The framework, like other quantum machine learning (QML) schemes, requires on-loading and off-loading classical data to and from a quantum machine. In the seminal review of QML, Biamonte et al. (2017) discuss what they call the ‘input problem’. Cortese and Braje (2018) present a general purpose quantum circuit for loading $B$ classical bits into a quantum data structure comprising $\log_{2}(B)$ qubits with circuit depth ${\cal O}\left(\log(B)\right)$ . For continuous targets, QPMCMC requires reading ${\cal O}\left(P\right)$ real vectors $\bm{\theta}_{p}\in\mathbb{R}^{D}$ onto the quantum machine at each MCMC iteration. If $b$ is the number of bits used to represent a single real value, then reading $B={\cal O}\left(PDb\right)$ classical bits into the quantum machine requires time ${\cal O}\left(\log(PDb)\right)$ . This is true whether one opts for fixed-point (Jordan, 2005) or floating-point (Haener et al., 2018) representations for real values. The outlook can be better for discrete distributions. For example, the QPMCMC scheme presented in Section 4 only requires loading the entire configuration state once prior to sampling. A $D$ -spin Ising model requires $D$ classical bits to encode, and these bits load onto a quantum machine in time ${\cal O}\left(\log(D)\right)$ . Once one has encoded the entire system state, each QPMCMC iteration only requires loading the addresses of the ${\cal O}\left(P\right)$ proposal states. If one uses $b$ bits to encode each address, then the total time required to load data onto the quantum machine is ${\cal O}\left(\log(Pb)\right)$ for each QPMCMC iteration. On the other hand, the speedup for discrete targets assumes the ability to hold an entire configuration in QRAM. Conveniently, the ‘output problem’ is less of an issue for QPMCMC, as only a single integer $\hat{p}\in\{0,\dots,P\}$ need be extracted within Algorithm 4.

This work leads to three interesting questions. First, what is the status of inference obtained by QPMCMC? The QPMCMC selection step relies on quantum minimization, an algorithm that only achieves success with probability $1-\epsilon$ . While empirical studies suggest that this error induces little bias, it would be helpful to use this $\epsilon$ to bound the distance between the target distribution and the distribution generated by QPMCMC. Such theoretical efforts would need to extend recent formalizations of the multiproposal based parallel MCMC paradigm (Glatt-Holtz et al., 2022) to account for biased MCMC kernels.

Second, can QPMCMC be combined with established quantum algorithms that make use of ‘quantum walks’ on graphs to sample from discrete target distributions? Szegedy (2004) presents a quantum analog to classical ergodic reversible Markov chains and shows that such quantum walks sometimes provide quadratic speedups over classical Markov chains. Szegedy (2004) also points out that Grover’s search, a key component within QPMCMC, may be interpreted as performing just such a quantum walk on a complete graph. Wocjan and Abeyesinghe (2008) accelerate the quantum walk by using ancillary Markov chains to improve mixing and apply their method to simulated annealing. Given a quantum algorithm $\mathbb{A}$ for producing a discrete random sample with variance $\sigma^{2}$ , Montanaro (2015) develops a quantum algorithm for estimating the mean of algorithm $\mathbb{A}$ ’s output with error $\epsilon$ by running algorithm $\mathbb{A}$ a mere $\widetilde{\mathcal{O}}(\sigma/\epsilon)$ times, where $\widetilde{\mathcal{O}}$ hides polylogarithmic terms. Importantly, this quadratic quantum speedup over classical Monte Carlo applies for quantum algorithms $\mathbb{A}$ that only feature a single measurement such as certain simple quantum walk algorithms. Unfortunately, this assumption fails for quantizations of Metropolis-Hastings, in general, and QPMCMC, in particular. More promising for QPMCMC, Lemieux et al. (2020) develop a quantum circuit that applies Metropolis-Hastings to the Ising model without the need for oracle calls. An interesting question is whether similar non-oracular quantum circuits exist for the basic parallel MCMC backbone to QPMCMC. In general, however, comparison between QPMCMC and other quantum Monte Carlo techniques is challenging because the foregoing literature (a) largely focuses on MCMC as a tool for discrete optimization, with algorithms that only ever return a single Monte Carlo sample or function thereof, and (b) restricts itself to a handfull of simple, stylized and discrete target distributions. On the other hand, QPMCMC is a general inferential framework for sampling from general discrete and continuous distributions alike.

Third, there are other models and algorithms in statistics, machine learning and statistical mechanics that require sampling from potentially costly discrete distributions. Can one adapt our quantum Gumbel-max trick to these? Approximate Bayesian computation (Csilléry et al., 2010) extends probabilistic inference to contexts within which one only has access to a complex, perhaps computationally intensive, forward model. Having sampled model parameters from the prior, one accepts or rejects these parameter values based on the discrepancy between the observed and simulated data. If one succeeds in embedding forward model dynamics within a quantum circuit, then one may plausibly select from many parameter values using our quantum Gumbel-max trick. The trick may also find use within sequential Monte Carlo (Doucet et al., 2001). For example, Berzuini and Gilks (2001) present a sequential importance resampling algorithm that uses MCMC-type moves to encourage particle diversity and avoid the need for bootstrap resampling. Multiproposals accelerated by the quantum Gumbel-max trick could add speed and robustness to such MCMC-resampling.

Large-scale, practical quantum computing is still a long way off, but quantum algorithms are ripe for mainstream computational statistics.

Acknowledgments

This work was supported by grants NIH K25 AI153816, NSF DMS 2152774 and NSF DMS 2236854.

Appendix A Limited introduction to quantum computing

Quantum computers perform operations on unit-length vectors of complex data called quantum bits or qubits. One may write any qubit $\bm{\psi}$ as a linear combination of the computational basis states $\ket{0}$ and $\ket{1}$ . In symbols,

\displaystyle\ket{\psi}=\alpha\ket{0}+\beta\ket{1}\quad\mbox{for}\quad\alpha,\,\beta\in\mathbb{C}\quad\mbox{and}\quad|\alpha|^{2}+|\beta|^{2}=1\,.

This formula uses Dirac or bra-ket notation and obscures ideas that are commonplace in the realm of applied statistics. We make the unit-vector specification of $\ket{\psi}$ clear by writing

\displaystyle\ket{0}=\begin{bmatrix}1\\ 0\end{bmatrix}\,,\quad\ket{1}=\begin{bmatrix}0\\ 1\end{bmatrix}\quad\mbox{or}\quad\ket{\psi}=\alpha\begin{bmatrix}1\\ 0\end{bmatrix}+\beta\begin{bmatrix}0\\ 1\end{bmatrix}\,.

The full machinery of linear algebra is also available within this notation. The conjugate transpose of $\ket{\psi}$ is $\bra{\psi}$ . The inner product of $\ket{\psi}$ and $\ket{\phi}$ is $\braket{\phi}{\psi}$ . The outer product is $\ket{\psi}\bra{\phi}$ . Naturally, we can write $\bm{\psi}$ as a linear combination of any other such basis elements. Consider instead the vectors

\displaystyle\ket{+}=\frac{1}{\sqrt{2}}\ket{0}+\frac{1}{\sqrt{2}}\ket{1}\quad\mbox{and}\quad\ket{-}=\frac{1}{\sqrt{2}}\ket{0}-\frac{1}{\sqrt{2}}\ket{1}\,.

A little algebra shows that $\ket{+}$ and $\ket{-}$ are indeed unit-length and orthogonal to each other. A little more algebra reveals that, with respect to this basis, $\bm{\psi}$ has coefficients $\alpha^{\prime}$ and $\beta^{\prime}$ given by $(\alpha+\beta)/\sqrt{2}$ and $(\alpha-\beta)/\sqrt{2}$ , respectively. A last bit of algebra shows that this representation is consistent with $\bm{\psi}$ ’s unit length.

But linear algebra is not the only aspect of quantum computing that should come easily to applied statisticians. In addition to thinking of $\bm{\psi}$ as a vector, it is also useful to think of $\bm{\psi}$ as a (discrete) probability distribution over the computational basis states $\ket{0}$ and $\ket{1}$ . The constraints on coefficients $\alpha$ and $\beta$ mean that we can think of $|\alpha|^{2}$ and $|\beta|^{2}$ as probabilities that $\bm{\psi}$ is in state $\ket{0}$ or $\ket{1}$ , respectively. Accordingly, $\ket{+}$ and $\ket{-}$ encode uniform distributions over the computational basis states. In the parlance of quantum mechanics, $\alpha$ , $\beta$ and $\pm 1/\sqrt{2}$ are all probability amplitudes, and $\bm{\psi}$ , $\ket{+}$ and $\ket{-}$ are all superpositions of the computational basis states. Quantum measurement of $\bm{\psi}$ results in $\ket{0}$ with probability $|\alpha|^{2}$ , but—in the following—we can safely think of this physical procedure as drawing a single discrete sample from $\bm{\psi}$ ’s implied probability distribution.

Quantum logic gates perform linear operations on qubits like $\bm{\psi}$ and take the form of unitary matrices $\mathbf{U}$ satisfying $\mathbf{U}^{\dagger}\mathbf{U}=\mathbf{I}$ for $\mathbf{U}^{\dagger}$ the conjugate transpose of $\mathbf{U}$ . One terrifically important single-qubit quantum gate is the Hadamard gate

\displaystyle\mathbf{H}=\frac{1}{\sqrt{2}}\left[\begin{array}[]{rr}1&1\\ 1&-1\end{array}\right]\,.

One may verify that $\mathbf{H}$ is indeed unitary and that its action maps $\ket{0}$ to $\ket{+}$ and $\ket{1}$ to $\ket{-}$ . In fact, the reverse is also true on account of the symmetry of $\mathbf{H}$ . The Hadamard gate thus takes the computational basis states in and out of superposition, facilitating a phenomenon called quantum parallelism. Given a function $f:\{0,1\}\rightarrow\{0,1\}$ , consider the two-qubit quantum oracle gate $\mathbf{U}_{f}$ which takes $\ket{x}\ket{y}$ as input and returns $\ket{x}\ket{y\oplus f(x)}$ as output, where $\oplus$ denotes addition modulo 2. Importantly, the output simplifies to $\ket{x}\ket{f(x)}$ for $y=0$ . We now consider a quantum circuit that acts on two qubits by first applying the Hadamard transform $\mathbf{H}$ to the first qubit and then applying the oracle gate $\mathbf{U}_{f}$ to both. Using the state $\ket{0}\ket{0}$ as input, we have

\displaystyle\ket{0}\ket{0}\longrightarrow\frac{1}{\sqrt{2}}\ket{0}\ket{0}+\frac{1}{\sqrt{2}}\ket{1}\ket{0}\longrightarrow\frac{1}{\sqrt{2}}\ket{0}\ket{f(0)}+\frac{1}{\sqrt{2}}\ket{1}\ket{f(1)}\,.

(12)

The quantum circuit evaluates $f(\cdot)$ simultaneously over both inputs! Unfortunately, the scientist who implements this circuit cannot directly access $\mathbf{U}_{f}$ ’s output, and measurement will provide only $\ket{0}\ket{f(0)}$ or $\ket{1}\ket{f(1)}$ with probability $1/2$ each. Unlocking the potential of quantum parallelism requires more ingenuity.

Uncountably many single- and two-qubit quantum gates exist, but the real power of quantum computing stems from the development of complex quantum gates that act on multiple qubits simultaneously. To access this power, we need one more tool that also appears in the statistics literature. The Kronecker or tensor product between an $L$ -by- $M$ matrix $\mathbf{A}$ and an $N$ -by- $O$ matrix $\mathbf{B}$ is the $LN$ -by- $MO$ matrix

\displaystyle\mathbf{A}\otimes\mathbf{B}=\begin{bmatrix}A_{11}\mathbf{B}&\dots&A_{1M}\mathbf{B}\\ \vdots&\ddots&\vdots\\ A_{L1}\mathbf{B}&\cdots&A_{LM}\mathbf{B}\end{bmatrix}\,.

(13)

In statistics, the Kronecker product features in the definition of a matrix normal distribution and is sometimes helpful when specifying the kernel function of a multivariate Gaussian process (Werner et al., 2008). Here, the product is equivalent to the parallel action of individual quantum gates on individual qubits. Simple application of Formula (13) shows that

\displaystyle\mathbf{H}^{\otimes 2}=\mathbf{H}\otimes\mathbf{H}=\frac{1}{2}\left[\begin{array}[]{rrrr}1&1&1&1\\ 1&-1&1&-1\\ 1&1&-1&-1\\ 1&-1&-1&1\end{array}\right]\quad\mbox{and}\quad\ket{0}\otimes\ket{0}=\begin{bmatrix}1\\ 0\end{bmatrix}\otimes\begin{bmatrix}1\\ 0\end{bmatrix}=\begin{bmatrix}1\\ 0\\ 0\\ 0\end{bmatrix}\,.

(18)

We may also denote the left product $\mathbf{H}^{\otimes 2}$ and the right product $\ket{0}^{\otimes 2}$ , $\ket{00}$ or $\ket{0}\ket{0}$ . One may therefore express (12) as a series of transformations applied to the 4-vector on the very right. Letting $\ket{10}$ , $\ket{01}$ and $\ket{11}$ take on analogous meanings to $\ket{00}$ , an immediate result of (18) is that

\displaystyle\mathbf{H}^{\otimes 2}\ket{00}=\frac{1}{2}\Big{(}\ket{00}+\ket{01}+\ket{10}+\ket{11}\Big{)}\,.

(19)

The action of $\mathbf{H}^{\otimes 2}$ transforms $\ket{00}$ into a superposition of the states $\ket{00}$ , $\ket{10}$ , $\ket{01}$ and $\ket{11}$ , where the probability of selecting each is a uniform $(1/2)^{2}=1/4$ . Writing so many $0$ s and $1$ s is tedious, so an alternative notation becomes preferable. Exchanging $\ket{0}$ for $\ket{00}$ , $\ket{1}$ for $\ket{01}$ , $\ket{2}$ for $\ket{10}$ , and $\ket{3}$ for $\ket{11}$ , (19) becomes the more succinct

\displaystyle\mathbf{H}^{\otimes 2}\ket{00}=\frac{1}{2}\Big{(}\ket{0}+\ket{1}+\ket{2}+\ket{3}\Big{)}\,.

This formula extends generally to operations over $n$ qubits. Now letting $N=2^{n}$ , we have

\displaystyle\mathbf{H}^{\otimes n}\ket{0}^{\otimes n}=\frac{1}{\sqrt{N}}\Big{(}\ket{0}+\ket{1}+\dots+\ket{N-1}\Big{)}=\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}=:\ket{h}\,,

and we call $\ket{h}$ a uniform superposition over the states $\ket{0},\dots,\ket{N-1}$ . The many-qubit analogue for the quantum parallelism of (12) is then

\displaystyle\ket{0}^{\otimes n}\ket{0}\longrightarrow\left(\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}\right)\ket{0}\longrightarrow\frac{1}{\sqrt{N}}\sum_{x=0}^{N-1}\ket{x}\ket{f(x)}\,.

Appendix B Gumbel-max

We wish to randomly select a single element from the set $\{0,1,\dots,P\}$ with probability proportional to the unnormalized probabilities $\bm{\pi}^{*}=(\pi_{0}^{*},\pi_{1}^{*},\dots,\pi_{P}^{*})$ . That is, there exists a $c>0$ such that $\bm{\pi}^{*}=c\bm{\pi}$ , for $\bm{\pi}$ a valid probability vector, but we only have access to $\bm{\pi}^{*}$ . Define $\bm{\lambda}:=\log\bm{\pi}$ and $\bm{\lambda}^{*}:=\log\bm{\pi}^{*}=\log\bm{\pi}+\log c$ , and assume that $z_{0},\dots,z_{P}\stackrel{{\scriptstyle iid}}{{\sim}}Gumbel(0,1)$ . Then, the probability density function $g(\cdot)$ and cumulative distribution function $G(\cdot)$ for each individual $z_{p}$ are

\displaystyle g(z_{p})=\exp\big{(}-z_{p}-\exp(-z_{p})\big{)}\quad\mbox{and}\quad G(z_{p})=\exp\big{(}-\exp(-z_{p})\big{)}\,.

Now, defining the random variables $\alpha^{*}_{p}:=\lambda^{*}_{p}+z_{p}$ , $\alpha_{p}:=\lambda_{p}+z_{p}$ and

\displaystyle\hat{p}=\operatornamewithlimits{arg\,max}_{p=0,\dots,P}\>\alpha^{*}_{p}\,,

we have the result

\displaystyle\mbox{Pr}(\hat{p}=p)=\pi_{p}\,.

In words, the procedure of adding Gumbel noise to unnormalized log-probabilities and taking the index of the maximum produces a random variable that follows the discrete distribution over $\bm{\pi}$ . Moving from left to right:

	$\displaystyle\mbox{Pr}(\hat{p}=p)$	$\displaystyle=\mbox{Pr}(\alpha^{}_{p}>\alpha^{}_{p^{\prime}},\,\forall p^{\prime}\neq p)$
		$\displaystyle=\mbox{Pr}(\alpha_{p}+\log c>\alpha_{p^{\prime}}+\log c,\,\forall p^{\prime}\neq p)=\mbox{Pr}(\alpha_{p}>\alpha_{p^{\prime}},\,\forall p^{\prime}\neq p)$
		$\displaystyle=\int_{-\infty}^{\infty}\prod_{p^{\prime}\neq p}\mbox{Pr}(\alpha_{p}>\alpha_{p^{\prime}}\|\alpha_{p})g(\alpha_{p}-\lambda_{p})\,\mbox{d}\alpha_{p}=\int_{-\infty}^{\infty}\prod_{p^{\prime}\neq p}G(\alpha_{p}-\lambda_{p^{\prime}})g(\alpha_{p}-\lambda_{p})\,\mbox{d}\alpha_{p}$
		$\displaystyle=\int_{-\infty}^{\infty}\prod_{p^{\prime}\neq p}\exp\big{(}-\exp(\lambda_{p^{\prime}}-\alpha_{p})\big{)}\exp\big{(}-\alpha_{p}+\lambda_{p}-\exp(-\alpha_{p}+\lambda_{p})\big{)}\,\mbox{d}\alpha_{p}\,.$

Recalling that $\lambda_{p^{\prime}}=\log\pi_{p^{\prime}}$ , we exponentiate the logarithms, and the integral becomes

	$\displaystyle\pi_{p}\int_{-\infty}^{\infty}\prod_{p^{\prime}\neq p}\exp\big{(}-\pi_{p^{\prime}}\exp(-\alpha_{p})\big{)}\exp\big{(}-\alpha_{p}-\pi_{p}\exp(-\alpha_{p})\big{)}\,\mbox{d}\alpha_{p}$
	$\displaystyle=\pi_{p}\int_{-\infty}^{\infty}\exp(-\alpha_{p})\exp\big{(}-\sum_{p^{\prime}=0}^{P}\pi_{p^{\prime}}\exp(-\alpha_{p})\big{)}\,\mbox{d}\alpha_{p}$
	$\displaystyle=\pi_{p}\int_{-\infty}^{\infty}\exp(-\alpha_{p})\exp\big{(}-\exp(-\alpha_{p})\big{)}\,\mbox{d}\alpha_{p}=\pi_{p}\,,$

where the final equality follows easily from a change of variables.

Appendix C Mixing of parallel MCMC and QPMCMC

To ascertain whether QPMCMC mixes differently compared to conventional parallel MCMC, we run both algorithms for a range of proposal counts to sample from a 10-dimensional standard normal distribution. For each algorithm and proposal setting, we run 100 independent chains for 10,000 iterations and obtain effective sample sizes ESS_d for $d\in\{1,\dots,10\}$ . We then calculate the relative differences between the means and minima of one chain generated using parallel MCMC and one chain generated using QPMCMC; for example:

\displaystyle\mbox{Relative difference between means}\>\overline{\mbox{ESS}}_{(\cdot)}=\frac{\left|\overline{\mbox{ESS}}_{pMCMC}-\overline{\mbox{ESS}}_{QPMCMC}\right|}{\overline{\mbox{ESS}}_{pMCMC}}

Figure 9 shows results. In general, the majority relative differences are small. For both statistics, mean relative differences are less than 0.05, regardless of proposal count. Again for both statistics, more than 75% of the independent runs result in relative differences below 0.1. We note that relative differences between means (blue) appear to decrease with the number of proposals, but the same does not hold for relative differences between minima (red).

Appendix D Note on simulations

We use R (R Core Team, 2021), Python (van Rossum, 1995), TensorFlow (Abadi et al., 2016) and TensorFlow Probability MCMC (Lao et al., 2020) in our simulations and the ggplot2 package to generate all figures (Wickham, 2016). In R, we use the package coda (Plummer et al., 2006) to calculate effective sample sizes. In Python, we use a built-in function from TensorFlow Probability MCMC. The primary color palette comes from Ram and Wickham (2018).

All simulations rely on the fact that the Grover iterations of (3) manipulate the probability amplitudes in a deterministic manner. For example, the following R code specifies $N$ uniform probability amplitudes (sqrtProbs), performs $I$ Grover iterations and measures a single state according to the resulting probabilities.

sqrtProbs <- rep(1/sqrt(N),N); i <- 1while (i <= I) {sqrtProbs <- (1-2*marks)*sqrtProbssqrtProbs <- -sqrtProbs + 2*mean(sqrtProbs)i <- i + 1}measurement <- sample(size=1, x=1:N, prob=sqrtProbs^2)Of course, simulating these iterations on a classical computer requires precomputing the values of marks beforehand. All code is available online at https://github.com/andrewjholbrook/qpMCMC.

References

Abadi et al. (2016) Abadi, M., P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. 2016. $\{$ TensorFlow $\}$ : A system for $\{$ Large-Scale $\}$ machine learning. Pages 265–283 in 12th USENIX symposium on operating systems design and implementation (OSDI 16).
Akiyama et al. (2022) Akiyama, K., A. Alberdi, W. Alef, J. C. Algaba, R. Anantua, K. Asada, R. Azulay, U. Bach, A.-K. Baczko, D. Ball, et al. 2022. First sagittarius a* event horizon telescope results. i. the shadow of the supermassive black hole in the center of the milky way. The Astrophysical Journal Letters 930:L12.
Bennett et al. (1997) Bennett, C. H., E. Bernstein, G. Brassard, and U. Vazirani. 1997. Strengths and weaknesses of quantum computing. SIAM journal on Computing 26:1510–1523.
Berzuini and Gilks (2001) Berzuini, C. and W. Gilks. 2001. Resample-move filtering with cross-model jumps. Pages 117–138 in Sequential Monte Carlo Methods in Practice. Springer.
Biamonte et al. (2017) Biamonte, J., P. Wittek, N. Pancotti, P. Rebentrost, N. Wiebe, and S. Lloyd. 2017. Quantum machine learning. Nature 549:195–202.
Boyer et al. (1998) Boyer, M., G. Brassard, P. Høyer, and A. Tapp. 1998. Tight bounds on quantum searching. Fortschritte der Physik: Progress of Physics 46:493–505.
Calderhead (2014) Calderhead, B. 2014. A general construction for parallelizing metropolis- hastings algorithms. Proceedings of the National Academy of Sciences 111:17408–17413.
Cortese and Braje (2018) Cortese, J. A. and T. M. Braje. 2018. Loading classical data into a quantum computer. arXiv preprint arXiv:1803.01958 .
Csilléry et al. (2010) Csilléry, K., M. G. Blum, O. E. Gaggiotti, and O. François. 2010. Approximate bayesian computation (abc) in practice. Trends in ecology & evolution 25:410–418.
Delmas and Jourdain (2009) Delmas, J.-F. and B. Jourdain. 2009. Does waste recycling really improve the multi-proposal metropolis–hastings algorithm? an analysis based on control variates. Journal of applied probability 46:938–959.
Doucet et al. (2001) Doucet, A., N. De Freitas, N. J. Gordon, et al. 2001. Sequential Monte Carlo methods in practice vol. 1. Springer.
Durr and Hoyer (1996) Durr, C. and P. Hoyer. 1996. A quantum algorithm for finding the minimum. arXiv preprint quant-ph/9607014 .
Erhard et al. (2019) Erhard, A., J. J. Wallman, L. Postler, M. Meth, R. Stricker, E. A. Martinez, P. Schindler, T. Monz, J. Emerson, and R. Blatt. 2019. Characterizing large-scale quantum computers via cycle benchmarking. Nature communications 10:1–7.
Frenkel (2004) Frenkel, D. 2004. Speed-up of monte carlo simulations by sampling of rejected states. Proceedings of the National Academy of Sciences 101:17571–17575.
Gelman and Rubin (1992a) Gelman, A. and D. B. Rubin. 1992a. Inference from iterative simulation using multiple sequences. Statistical science 7:457–472.
Gelman and Rubin (1992b) Gelman, A. and D. B. Rubin. 1992b. A single series from the gibbs sampler provides a false sense of security. Bayesian statistics 4:625–631.
Giovannetti et al. (2008) Giovannetti, V., S. Lloyd, and L. Maccone. 2008. Quantum random access memory. Physical review letters 100:160501.
Glatt-Holtz et al. (2022) Glatt-Holtz, N. E., A. J. Holbrook, J. A. Krometis, and C. F. Mondaini. 2022. Parallel mcmc algorithms: Theoretical foundations, algorithm design, case studies.
Grover (1996) Grover, L. K. 1996. A fast quantum mechanical algorithm for database search. Pages 212–219 in Proceedings of the twenty-eighth annual ACM symposium on Theory of computing.
Haener et al. (2018) Haener, T., M. Soeken, M. Roetteler, and K. M. Svore. 2018. Quantum circuits for floating-point arithmetic. Pages 162–174 in International Conference on Reversible Computation Springer.
Harrow et al. (2009) Harrow, A. W., A. Hassidim, and S. Lloyd. 2009. Quantum algorithm for linear systems of equations. Physical review letters 103:150502.
Hastings (1970) Hastings, W. K. 1970. Monte carlo sampling methods using markov chains and their applications .
Holbrook (2021) Holbrook, A. J. 2021. Generating mcmc proposals by randomly rotating the regular simplex. arXiv preprint arXiv:2110.06445 .
Holbrook et al. (2022a) Holbrook, A. J., X. Ji, and M. A. Suchard. 2022a. Bayesian mitigation of spatial coarsening for a hawkes model applied to gunfire, wildfire and viral contagion. The Annals of Applied Statistics 16:573–595.
Holbrook et al. (2022b) Holbrook, A. J., X. Ji, and M. A. Suchard. 2022b. From viral evolution to spatial contagion: a biologically modulated hawkes model. Bioinformatics 38:1846–1856.
Holbrook et al. (2021a) Holbrook, A. J., P. Lemey, G. Baele, S. Dellicour, D. Brockmann, A. Rambaut, and M. A. Suchard. 2021a. Massive parallelization boosts big bayesian multidimensional scaling. Journal of Computational and Graphical Statistics 30:11–24.
Holbrook et al. (2021b) Holbrook, A. J., C. E. Loeffler, S. R. Flaxman, and M. A. Suchard. 2021b. Scalable bayesian inference for self-excitatory stochastic processes applied to big american gunfire data. Statistics and Computing 31:1–15.
Hurn (1997) Hurn, M. 1997. Difficulties in the use of auxiliary variables in markov chain monte carlo methods. Statistics and Computing 7:35–44.
Jordan (2005) Jordan, S. P. 2005. Fast quantum algorithm for numerical gradient estimation. Physical review letters 95:050501.
Kielpinski et al. (2002) Kielpinski, D., C. Monroe, and D. J. Wineland. 2002. Architecture for a large-scale ion-trap quantum computer. Nature 417:709–711.
Lao et al. (2020) Lao, J., C. Suter, I. Langmore, C. Chimisov, A. Saxena, P. Sountsov, D. Moore, R. A. Saurous, M. D. Hoffman, and J. V. Dillon. 2020. tfp. mcmc: Modern markov chain monte carlo tools built for modern hardware. arXiv preprint arXiv:2002.01184 .
Lemieux et al. (2020) Lemieux, J., B. Heim, D. Poulin, K. Svore, and M. Troyer. 2020. Efficient quantum walk circuits for metropolis-hastings algorithm. Quantum 4:287.
Lopatnikova and Tran (2021) Lopatnikova, A. and M.-N. Tran. 2021. An introduction to quantum computing for statisticians. arXiv preprint arXiv:2112.06587 .
Luo and Tjelmeland (2019) Luo, X. and H. Tjelmeland. 2019. A multiple-try metropolis–hastings algorithm with tailored proposals. Computational Statistics 34:1109–1133.
Metropolis et al. (1953) Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. Equation of state calculations by fast computing machines. The journal of chemical physics 21:1087–1092.
Montanaro (2015) Montanaro, A. 2015. Quantum speedup of monte carlo methods. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 471:20150301.
Neal (2011) Neal, R. 2011. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo 2:2.
Neal (2003) Neal, R. M. 2003. Markov chain sampling for non-linear state space models using embedded hidden markov models. arXiv preprint math/0305039 .
Nielsen and Chuang (2002) Nielsen, M. A. and I. Chuang. 2002. Quantum computation and quantum information.
Papandreou and Yuille (2011) Papandreou, G. and A. L. Yuille. 2011. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. Pages 193–200 in 2011 International Conference on Computer Vision IEEE.
Plummer et al. (2006) Plummer, M., N. Best, K. Cowles, and K. Vines. 2006. Coda: Convergence diagnosis and output analysis for mcmc. R News 6:7–11.
R Core Team (2021) R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing Vienna, Austria.
Ram and Wickham (2018) Ram, K. and H. Wickham. 2018. wesanderson: A Wes Anderson Palette Generator. R package version 0.3.6.
Roberts and Rosenthal (2001) Roberts, G. O. and J. S. Rosenthal. 2001. Optimal scaling for various metropolis-hastings algorithms. Statistical science 16:351–367.
Rosenthal (2011) Rosenthal, J. S. 2011. Optimal proposal distributions and adaptive mcmc. Handbook of Markov Chain Monte Carlo 4.
Salmon et al. (2011) Salmon, J. K., M. A. Moraes, R. O. Dror, and D. E. Shaw. 2011. Parallel random numbers: as easy as 1, 2, 3. Pages 1–12 in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis.
Schwedes and Calderhead (2021) Schwedes, T. and B. Calderhead. 2021. Rao-blackwellised parallel mcmc. Pages 3448–3456 in International Conference on Artificial Intelligence and Statistics PMLR.
Shor (1994) Shor, P. 1994. Algorithms for quantum computation: discrete logarithms and factoring. Pages 124–134 in Proceedings 35th Annual Symposium on Foundations of Computer Science.
Szegedy (2004) Szegedy, M. 2004. Quantum speed-up of markov chain based algorithms. Pages 32–41 in 45th Annual IEEE symposium on foundations of computer science IEEE.
Tjelmeland (2004) Tjelmeland, H. 2004. Using all metropolis–hastings proposals to estimate mean values. Tech. rep. Citeseer.
van Rossum (1995) van Rossum, G. 1995. Python reference manual. Department of Computer Science [CS] .
Wang and Liu (2022) Wang, Y. and H. Liu. 2022. Quantum computing in a statistical context. Annual Review of Statistics and Its Application 9.
Werner et al. (2008) Werner, K., M. Jansson, and P. Stoica. 2008. On estimation of covariance matrices with kronecker product structure. IEEE Transactions on Signal Processing 56:478–491.
Wickham (2016) Wickham, H. 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wocjan and Abeyesinghe (2008) Wocjan, P. and A. Abeyesinghe. 2008. Speedup via quantum sampling. Physical Review A 78:042336.
Yang et al. (2018) Yang, S., Y. Chen, E. Bernton, and J. S. Liu. 2018. On parallelizable markov chain monte carlo algorithms with waste-recycling. Statistics and Computing 28:1073–1081.
Yoder et al. (2014) Yoder, T. J., G. H. Low, and I. L. Chuang. 2014. Fixed-point quantum search with an optimal number of queries. Physical review letters 113:210501.