This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Avoiding Barren Plateaus with Entanglement

Yuhan Yao [email protected]    Yoshihiko Hasegawa [email protected] Department of Information and Communication Engineering, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-8656, Japan
Abstract

In the search for quantum advantage with near-term quantum devices, navigating the optimization landscape is significantly hampered by the barren plateaus phenomenon. This study presents a strategy to overcome this obstacle without changing the quantum circuit architecture. We propose incorporating auxiliary control qubits to shift the circuit from a unitary 22-design to a unitary 11-design, mitigating the prevalence of barren plateaus. We then remove these auxiliary qubits to return to the original circuit structure while preserving the unitary 11-design properties. Our experiment suggests that the proposed structure effectively mitigates the barren plateaus phenomenon. A significant experimental finding is that the gradient of θ1,1\theta_{1,1}, the first parameter in the quantum circuit, displays a broader distribution as the number of qubits and layers increases. This suggests a higher probability of obtaining effective gradients. This stability is critical for the efficient training of quantum circuits, especially for larger and more complex systems. The results of this study represent a significant advance in the optimization of quantum circuits and offer a promising avenue for the scalable and practical implementation of quantum computing technologies. This approach opens up new opportunities in quantum learning and other applications that require robust quantum computing power.

preprint: APS/123-QED

I Introduction

Quantum information science, especially quantum computing, has made significant theoretical and experimental progress in recent years. With the advent of the first generation of quantum computers, we have entered the era of noisy intermediate-scale quantum (NISQ) devices [1, 2, 3]. Despite the potential advantages that these early quantum computers have in tackling complex computational problems, their practical application is hindered by several factors, including the quality of the qubits, error rates in the quantum gates, and algorithmic optimization problems such as the phenomenon of barren plateaus [4, 5, 6, 7, 8, 9, 10]. The concept of “barren plateaus” is important for machine quantum learning and optimization. It refers to regions in the parameter space of quantum neural networks [8] whose gradients are minimal, close to zero. This poses a major challenge for gradient-based optimization methods, such as gradient descent, because the gradient information provides little to no guidance, leading to stagnation in the training process. This phenomenon is characterized by extremely flat gradients in high-dimensional quantum parameter spaces. It has proven to be a critical bottleneck in optimizing quantum algorithms and hinders the further development of quantum computing.

The implementation of quantum algorithms can be hindered by barren plateaus, which is especially problematic for quantum machine learning algorithms and variable quantum algorithms (VQAs) [11, 12, 5, 13, 14]. As quantum systems increase in size, the issue of gradient descent becomes more pronounced during the optimization process, resulting in inefficiencies in local search strategies. Moreover, this challenge is closely related to the limitations of quantum hardware, such as the limited number of qubits [15, 16, 17] and errors in quantum gate operations [18, 19, 20, 21]. These challenges affect the practicality of quantum algorithms and limit the potential applications of quantum computing in various fields.

This study explores the causes and possible solutions of the barren plateaus phenomenon and its impact on optimizing quantum algorithms. We focus on developing new algorithms and strategies to address the issue of gradient vanishing in high-dimensional quantum parameter spaces. This includes improving optimization strategies, exploring effective initialization protocols, and investigating the relationship between gradient vanishing and the properties of quantum hardware. Our ultimate goal is to enhance the efficiency and success rate of quantum algorithm optimization, thus enabling the application of quantum computing in practical scenarios. Additionally, we provide insights into the understanding and control of complex quantum systems.

In related works, we review the essential literature and theoretical foundations related to the phenomenon of barren plateaus in variational quantum algorithms (VQAs) [5, 22, 23, 24, 25]. Barren plateaus are characterized by gradient vanishing issues in parameterized quantum circuits and have become a focal point in current quantum computing research. First, we refer to studies on the relationship between entanglement and learning efficiency in quantum neural networks [22], which suggest that excessive quantum entanglement can lead to reduced learning efficiency. Additionally, regarding the importance of initialization strategies in overcoming barren plateaus, we examine technical notes on new initialization techniques [23] that prevent early-stage gradient disappearance by controlling initial parameter selection. Next, we consider research defining weak barren plateaus (WBPs) based on the entropies of reduced local density matrices [24], providing a new perspective for understanding and quantifying barren plateaus. Lastly, we synthesize studies on the impact of observable selection on trainability in VQAs [25], highlighting the differences in defining cost functions using global versus local observables and their effects on gradient vanishing issues. These pieces of literature and theoretical insights form the foundation of our study, paving the way for us to propose new solutions and theoretical insights.

This paper presents an approach to address the barren plateaus phenomenon in quantum circuits by strategically entangling auxiliary qubits into the circuit. The gradient is then maintained while removing the auxiliary qubit. This approach transforms the circuit into a local unitary 11-design without altering its original structure or functionality. The initial unitary operation is changed into a form similar to αI+βU\alpha I+\beta U, which shifts the distribution from a unitary tt-design to a unitary (t1)(t-1)-design. This modification preserves the anticipated gradient value while decreasing the gradient variance’s reliance on the number of qubits. The methodology also involves a structured optimization process in which auxiliary qubits are gradually removed during training sessions. This assimilates the pending layers into the fixed layers with trained parameters. This preserves the circuit’s core functions, ensures efficient operation, and prevents regression to a unitary tt-design.

The significance of this study lies in its potential impact on the field of quantum computing. Our method simplifies the complexity of the quantum circuit, enhancing the efficiency of parameter training and offering a new perspective on addressing the barren plateaus challenge. This statement has important implications for developing efficient quantum algorithms and advancing quantum computing towards more practical applications. Future research could investigate the application of our method to various types of quantum circuits and parameter training techniques to optimize and expand its applicability.

II PRELIMINARIES

Our analysis employs random unitary operations along with tt-designs. To provide a clear foundation, we will start with an introduction to these key concepts. Let U(N)U(N) be the unitary group of degree NN, and denote the Haar measure on U(N)U(N) by 𝐇\mathbf{H}. A Haar random unitary μ\mu is U(N)U(N)-valued random variable distributed according to the Haar measure μ𝐇\mu\sim\mathbf{H} [26].

Let ν\nu be a probability measure on the unitary U(N)U(N). A random unitary μ\mu drawn from ν\nu is called an ϵ\epsilon-approximate unitary tt-designs if it satisfies the condition 𝒢μν(t)𝒢μ𝐇(t)ϵ\|\mathcal{G}_{\mu\sim\nu}^{(t)}-\mathcal{G}_{\mu\sim\mathbf{H}}^{(t)}\|_{\diamond}\leq\epsilon, where \|\bullet\|_{\diamond} represents the diamond norm, 𝒢μν(t)\mathcal{G}_{\mu\sim\nu}^{(t)} and 𝒢μ𝐇(t)\mathcal{G}_{\mu\sim\mathbf{H}}^{(t)} represent some specific mathematical expressions or operations involving μ\mu [26].

Let XX be a finite subset of U(N)U(N), the group of N×NN\times N unitary matrices. We consider the expression

1|X|UXft(U)=U(N)𝑑μft(U),\displaystyle\frac{1}{\absolutevalue{X}}\sum_{U\in X}f^{\otimes t}(U)=\int_{U(N)}d\mu\cdot f^{\otimes t}(U), (1)

where dμd\mu denotes the unit Haar measure on U(N)U(N), satisfying U(N)𝑑μ=1\int_{U(N)}d\mu=1. The cardinality of XX, denoted by |X|\absolutevalue{X}, refers to the number of subsets contained within XX.

The subset XU(N)X\subseteq U(N) is a tt-design if it satisfies the condition [27]:

|X|D(N,t2,t2).\displaystyle\absolutevalue{X}\geq D\left(N,\left\lceil\frac{t}{2}\right\rceil,\left\lfloor\frac{t}{2}\right\rfloor\right). (2)

where DD is defined in appendix A. Equation (2) demonstrates that unitary tt-design does not hold universally. It necessitates the fulfillment of a specific number of conditions for its validity.

Refer to caption
Figure 1: The architecture of random parameterized quantum circuits (RPQCs) initiates with an array of Hadamard gates, corresponding to each of the nn qubits, setting the groundwork for quantum superposition. This is succeeded by LL sequential layers, each constituting a dual-component framework. The first component of each layer, Ul(𝜽l)U_{l}(\boldsymbol{\theta}_{l}) is parameterized, encompassing nn rotation gates. These gates are each defined by a set of rotational parameters Pl,i{X,Y,Z}P_{l,i}\in\{X,Y,Z\} and a vector of angular parameters θl,i[0,2π)\theta_{l,i}\in[0,2\pi), which is sampled independently, enabling the modulation of quantum states in a controlled manner. The second component, WlW_{l}, is non-parameterized and comprises a series of CNOT gates. These gates are placed to induce entanglement between adjacent qubits.

In Random Parameterized Quantum Circuits (RPQCs), comprising nn qubits and LL layers as demonstrated in Fig. 1, the unitary operator U(𝜽)U(\boldsymbol{\theta}) is defined as

U(𝜽)=l=1LUl(𝜽l)Wl=l=1Li=1neiθl,iViWl,\displaystyle U(\boldsymbol{\theta})=\prod_{l=1}^{L}U_{l}(\boldsymbol{\theta}_{l})W_{l}=\prod_{l=1}^{L}\prod_{i=1}^{n}e^{-i\theta_{l,i}V_{i}}W_{l}, (3)

where l[1,L]l\in[1,L] is for each layer and i[1,n]i\in[1,n] for each qubit. Ul(𝜽l)U_{l}(\boldsymbol{\theta}_{l}) and WlW_{l} are unitary operators. Here, ViV_{i} is a Pauli operator, and WlW_{l} is a fixed unitary operator that does not depend on the angle θl,i\theta_{l,i}.

Consider a quantum circuit where the initial state is prepared in |𝐢𝐧𝐢𝐭\ket{\mathbf{init}}. The objective function E(𝜽)E(\boldsymbol{\theta}) is defined as the expectation value of a Hermitian operator HH, which is provided externally. This expectation value is obtained by applying a unitary operation U(𝜽)U(\boldsymbol{\theta}) to the initial state, and then performing a measurement corresponding to the operator HH. The unitary operation is parameterized by a set of parameters 𝜽\boldsymbol{\theta}. The objective function is given by:

E(𝜽)=𝐢𝐧𝐢𝐭|U(𝜽)HU(𝜽)|𝐢𝐧𝐢𝐭,\displaystyle E(\boldsymbol{\theta})=\bra{\mathbf{init}}U(\boldsymbol{\theta})^{\dagger}HU(\boldsymbol{\theta})\ket{\mathbf{init}}, (4)

where U(𝜽)U(\boldsymbol{\theta})^{\dagger} is the Hermitian conjugate of U(𝜽)U(\boldsymbol{\theta}). To calculate the gradient of the objective function, we define it as follows: let kE\partial_{k}E represent the kk-th partial derivative of the function E(𝜽)E(\boldsymbol{\theta}) concerning the parameter θk\theta_{k}, which is denoted by kEE(𝜽)θk\partial_{k}E\equiv\frac{\partial E(\boldsymbol{\theta})}{\partial\theta_{k}}. It is important to note that the expected average value of this gradient is determined as

kE=1|θk|θkkE=𝑑μkE=0.\displaystyle\langle\partial_{k}E\rangle=\frac{1}{\absolutevalue{\theta_{k}}}\sum_{\theta_{k}}\partial_{k}E=\int d\mu\cdot\partial_{k}E=0. (5)

Additionally, the variance of the gradient is given by:

Var[kE]\displaystyle\mathrm{Var}[\partial_{k}E] =(kE)2kE2\displaystyle=\langle(\partial_{k}E)^{2}\rangle-\langle\partial_{k}E\rangle^{2} (6)
=1|θk|θk(kE)2=𝑑μ(kE)2\displaystyle=\frac{1}{\absolutevalue{\theta_{k}}}\sum_{\theta_{k}}(\partial_{k}E)^{2}=\int d\mu\cdot(\partial_{k}E)^{2}
Tr(H2)Tr(ρ2)Tr(Vk2)N3N,\displaystyle\approx\frac{\Tr{H^{2}}\Tr{\rho^{2}}\Tr{V_{k}^{2}}}{N^{3}-N},

where ρ=|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|\rho=\ket{\mathbf{init}}\bra{\mathbf{init}}.

As the number of qubits increases, the system’s dimensionality grows exponentially, where N=2nN=2^{n} and nn represent the number of qubits. In this scenario, the variance gradient decreases with the increase in NN and ultimately vanishes. This decrease in variance restricts the expressive potential of random parameterized quantum circuits [4].

III Methods

As discussed earlier, VQAs are plagued by vanishing gradients, commonly known as the barren plateaus problem. The barren plateaus arise due to the circuit’s unitary 22-design characteristics. Previous studies [24, 25, 23] have demonstrated that adopting a local unitary 11-design can mitigate this issue. This paper proposes a method to alleviate the barren plateaus problem by entanglement. It rests on modifying quantum circuits with additional qubits to transform them into a local unitary 11-design. Our study aims to address the barren plateaus phenomenon in quantum circuits by adding structure to the existing circuit framework. To achieve this, we incorporate auxiliary qubits into the system. These additional elements ensure that the original structure and function of the existing circuit remain unaltered. Instead, their role is primarily to mix additional information, which offers another way to address this quantum computing obstacle without necessitating structural modifications to the circuit.

III.1 Entanglement with auxiliary qubits

In our study, the goal is to overcome the problem of barren plateaus in quantum circuits by using auxiliary qubits without altering the original circuit design. We employ a specific configuration, as illustrated in Fig. 2, to transform the information on the initial qubit. This transformation changes the form from being similar to UU to a new form that resembles αI+βU\alpha I+\beta U, where α\alpha and β\beta are coefficients, and II is the identity matrix. This approach mitigates the barren plateaus phenomenon by leveraging the additional qubits. This transformation introduces no new variables and changes the distribution of the entire circuit from a unitary tt-design to a unitary (t1)(t-1)-design. While maintaining the expected value of the gradient, this transformation reduces the dependence of the variance of the gradient on the number of qubits, thereby enhancing the feasibility of circuit optimization.

Refer to caption
Figure 2: The structure is augmented with an additional structure involving log2n\lceil\log_{2}n\rceil qubits. This supplementary segment initiates with Hadamard gates, establishing a quantum superposition state. Subsequently, these qubits control the rotation gates within the RPQCs, following a binary encoding sequence. Finally, a measurement phase is conducted on these qubits to integrate the linear combination αI+βU\alpha I+\beta U into the RPQCs.

Then, let E(θ)E^{\prime}(\theta) be the expectation value of a Hermitian operator HH with respect to αI+βU\alpha I+\beta U. EE^{\prime} and kE\partial_{k}E^{\prime} are represented by

E(θ)\displaystyle E^{\prime}(\mathbf{\theta}) =𝐢𝐧𝐢𝐭|(αI+βU(θ))H(αI+βU(θ))|𝐢𝐧𝐢𝐭,\displaystyle=\bra{\mathbf{init}}(\alpha I+\beta U(\mathbf{\theta}))^{\dagger}H(\alpha I+\beta U(\mathbf{\theta}))\ket{\mathbf{init}}, (7)
kE\displaystyle\partial_{k}E^{\prime} =iαβ𝐢𝐧𝐢𝐭|[R,H]|𝐢𝐧𝐢𝐭+β2(kE),\displaystyle=i\alpha\beta\bra{\mathbf{init}}[R,H]\ket{\mathbf{init}}+\beta^{2}(\partial_{k}E), (8)

where R=U+VkUR=U_{+}V_{k}U_{-}. When calculating the expectation of kE\partial_{k}E^{\prime} with respect to UU, the first and the second moments are

kE\displaystyle\langle\partial_{k}E^{\prime}\rangle =0,\displaystyle=0, (9)
(kE)2\displaystyle\langle(\partial_{k}E^{\prime})^{2}\rangle =2α2β2NTr(H2ρ)Tr(ρ)+β4(kE)2,\displaystyle=\frac{2\alpha^{2}\beta^{2}}{N}\Tr{H^{2}\rho}\Tr{\rho}+\beta^{4}\langle(\partial_{k}E)^{2}\rangle, (10)

where (kE)2\left\langle\left(\partial_{k}E^{\prime}\right)^{2}\right\rangle follows from the property of a unitary 11-design. By observing the unitary 11-design part of (kE)2\langle(\partial_{k}E^{\prime})^{2}\rangle, we can find that

2α2β2NTr(H2ρ)Tr(ρ)12NTr(H2ρ)Tr(ρ).\displaystyle\frac{2\alpha^{2}\beta^{2}}{N}\Tr{H^{2}\rho}\Tr{\rho}\leq\frac{1}{2N}\Tr{H^{2}\rho}\Tr{\rho}. (11)

The maximum value is attained when α=β=12\alpha=\beta=\frac{1}{\sqrt{2}}.

III.2 Eliminate auxiliary qubits

In certain scenarios, using the original circuit becomes an unavoidable choice. However, when applying this optimization technique to adjust the parameters of these original circuits, it often becomes evident that the optimized parameters are not equivalent to those in the original circuit. This discrepancy arises from using different assumptions made during the optimization process.

The original circuit aimed to minimize the energy E(𝜽)=𝐢𝐧𝐢𝐭|U(𝜽)HU(𝜽)|𝐢𝐧𝐢𝐭E(\boldsymbol{\theta})=\bra{\mathbf{init}}U(\boldsymbol{\theta})^{\dagger}HU(\boldsymbol{\theta})\ket{\mathbf{init}}. However, with the introduction of auxiliary qubits, the energy changed to E(θ)=𝐢𝐧𝐢𝐭|(αI+βU(θ))H(αI+βU(θ))|𝐢𝐧𝐢𝐭E^{\prime}(\mathbf{\theta})=\bra{\mathbf{init}}(\alpha I+\beta U(\mathbf{\theta}))^{\dagger}H(\alpha I+\beta U(\mathbf{\theta}))\ket{\mathbf{init}}. Therefore, parameter sets that minimize E(𝜽)E(\boldsymbol{\theta}) and E(𝜽)E^{\prime}(\boldsymbol{\theta}) are generally different. Therefore, we will employ the structure depicted in Fig. 3. This involves gradually eliminating the auxiliary qubits through multiple training sessions until the circuit returns to its original configuration.

By successively transforming the adjustable layers into fixed layers through the intermediate layer, while keeping the parameters within the fixed layers unchanged, we can gradually restore the circuit to its original structure and eventually eliminate all auxiliary qubits. Adjustable and intermediate layers must be updated simultaneously during this training process. In this process, the adjustable layers are unitary 11-designs, and the intermediate layers are unitary 22-designs. To generate better gradient values, we must ensure that the number of layers in the layer to be processed is small enough for optimal results. This approach effectively overcomes the challenge posed by the barren plateaus phenomenon and ensures that parameter optimization continues smoothly.

The parameterized layers of the unitary operator in Fig. 3 conform to a unitary 11-design. We assume that 𝝋\boldsymbol{\varphi} represents a set of fixed parameters exempt from training to substantiate this claim. Let E′′E^{\prime\prime} be the energy EE^{\prime} with specific parameter values α=β=1/2\alpha=\beta=1/\sqrt{2}. Consequently, we can demonstrate the following results for E′′E^{\prime\prime} and its corresponding gradient kE′′\partial_{k}E^{\prime\prime}.

E′′(𝜽)\displaystyle E^{\prime\prime}(\boldsymbol{\theta}) =Tr(ρ2(I+U(𝜽))U(𝝋)HU(𝝋)(I+U(𝜽))),\displaystyle=\Tr{\frac{\rho}{2}(I+U(\boldsymbol{\theta}))^{\dagger}U^{\prime}(\boldsymbol{\varphi})^{\dagger}HU^{\prime}(\boldsymbol{\varphi})(I+U(\boldsymbol{\theta}))}, (12)
kE′′(𝜽)\displaystyle\partial_{k}E^{\prime\prime}(\boldsymbol{\theta}) =i2Tr(ρ[R,Hφ]+Hφ,+[ρ,Vk]),\displaystyle=\frac{i}{2}\Tr{\rho[R,H_{\varphi}]+H_{\varphi,+}[\rho_{-},V_{k}]}, (13)

where Hφ=U(𝝋)HU(𝝋)H_{\varphi}=U^{\prime}(\boldsymbol{\varphi})^{\dagger}HU^{\prime}(\boldsymbol{\varphi}), ρ=UρU\rho_{-}=U_{-}^{\dagger}\rho U_{-} and Hφ,+=U+HφU+H_{\varphi,+}=U_{+}^{\dagger}H_{\varphi}U_{+}. Then, if U+U_{+} is at least a unitary 11-design, the expectation and variance of kE′′\partial_{k}E^{\prime\prime} are

kE′′\displaystyle\langle\partial_{k}E^{\prime\prime}\rangle =0,\displaystyle=0, (14)
Var[kE′′]\displaystyle\mathrm{Var}[\partial_{k}E^{\prime\prime}] =12NTr(ρHφ2)Tr(ρ)+O(N1).\displaystyle=\frac{1}{2N}\Tr{\rho H_{\varphi}^{2}}\Tr{\rho}+O(N^{-1}). (15)

As the number of qubits increases, the variance of kE′′\partial_{k}E^{\prime\prime} will be larger than the variance of kE\partial_{k}E. This results in the distribution of kE\partial_{k}E becoming more concentrated around its mean value of zero compared to the distribution of kE′′\partial_{k}E^{\prime\prime}. Thus, the suggested approach is expected to achieve the effective value more reliably than the unitary 22-design structure, which helps overcome the barren plateaus issue.

Refer to caption
Figure 3: The structure consists of three sections: adjustable layers, intermediate layers, and fixed layers. Within this structure, the parameters in the adjustable and intermediate layers are updated simultaneously. During the update process, once the intermediate layer parameters are determined, this section becomes fixed layers with unchanging parameters. At this point, the last part of the adjustable layers becomes the new intermediate layers. This entire process is repeated continuously until the length of fixed layers equals LL, at which point the optimization is complete.

IV Experiment

In order to demonstrate the effectiveness of this proposed method, we perform a numerical experiment. The purpose of this experiment is to compare the performance of three quantum circuit models by using Pennylane [28]. The evaluated models are standard RPQCs, which serve as the unitary 22-design baseline model, a unitary 11-design structure incorporating auxiliary qubits, and the proposed structure that optimizes by eliminating these auxiliary qubits. The experimental design involves sampling 100 quantum circuits that are randomly generated according to each model’s configuration. For each circuit, we compute the variance of its gradient for the target operator H=Z1Z2H=Z_{1}\otimes Z_{2}, where HH is a single Pauli ZZ operator acting on the first and second qubits [25]. This comparison aims to reveal the impact of different design choices on the performance of quantum circuits, particularly in terms of gradient variance.

Firstly, we are concerned about the impact of the number of qubits on the variance of kE\partial_{k}E. We configure all structures to have 100 layers, with the number of qubits ranging from 2 to 16. The experimental results demonstrate that the variance of quantum circuits with a unitary 22-design structure decreases exponentially as the number of qubits increases. The slope of the curve is approximately -0.58, as shown in Fig. 4. This trend suggests that as the number of qubits increases, the gradient of the entire quantum circuit becomes easier to zero out, resulting in a significant decline in performance. The variance of quantum circuits with a unitary 11-design structure also decreases exponentially with the number of qubits. However, the slope of this decrease is only about half that of the unitary 22-design structure, which is consistent with the Eq. (10).

Refer to caption
Figure 4: The variances in gradients were compared with different numbers of qubits under unitary t-designs of different orders. The blue and orange points represent the variances for t=2 and t=1, respectively, across a range of qubit numbers. The variances are approximated by polynomial functions represented by the green and red lines, and their slopes are also given.

Subsequently, we evaluate the performance of the gradient to θ1,1\theta_{1,1}, the first parameter in the quantum circuit in Fig. 3, across varying configurations of qubits and layers within the quantum circuit architecture. Initially, the structure adheres to a unitary 22-design structure. This structure transforms a unitary 11-design upon entanglement with auxiliary qubits. After setting the pending layers to 20 and following the removal of these auxiliary qubits, the structure transitions into the proposed structure in this study. Our experimental assessment encompasses a range of qubits from 2 to 16 and layers from 5 to 500, providing a comprehensive overview of the impact of these variables on the gradient of the quantum circuit’s performance.

Consistent with previous findings in Fig. 4, the influence of the number of qubits on the variance of a single parameter demonstrates a similar pattern. The variance of quantum circuits with a unitary 22-design structure exhibits an exponential decline as the number of qubits increases, with a slope of approximately -0.69, as illustrated in Fig. 5. This trend highlights that as the number of qubits increases, the quantum circuit’s gradient becomes more susceptible to nullification on a local variable scale. This indicates the barren plateaus phenomenon, which affects the global landscape and local variational parameters, resulting in a significant performance deterioration. Conversely, quantum circuits with a unitary 11-design structure display a variance that shows no significant correlation with the number of qubits, with the slope nearing zero. The proposed method inherits the characteristics of the unitary 11-design and maintains consistent variance as the number of qubits increases.

Refer to caption
Figure 5: The variances in gradients were compared with different numbers of qubits under different structures. The blue and orange points represent the variances for unitary 22-design and 11-design, respectively, across a range of qubit numbers. The variances are approximated by polynomial functions represented by the green and red lines. The slopes of these lines are also given. The variances associated with the proposed structure are illustrated via purple lines.

Regarding the impact of the number of layers on variance, Fig. 6 shows that the variance decreases as the number of layers increases for quantum circuits with a unitary 22-design structure. Additionally, the decrease becomes more pronounced as the number of qubits increases. There is a smaller range of variance fluctuation for quantum circuits with a unitary 11-design structure, and the number of layers is not significantly affected. However, it should be noted that under conditions of fewer qubits or shallower layers, the unitary 22-design structure tends to have an advantage. To utilize both advantages, the proposed structure follows a unitary 22-design when the pending layers are small. As the layers deepen, the structure transitions to a unitary 11-design and is more likely to achieve an effective gradient.

Refer to caption
Figure 6: The variance of the gradient is compared for different numbers of layers under unitary t-designs of varying orders and proposed structure. The variances for t=2 and t=1 designs are depicted by the green and red lines, respectively, while the purple lines represent the proposed structure. The shading transitions from dark to light are quantum circuits with even numbers of qubits from 2- to 16-qubit, where the top line represents 2 qubits.

Then we investigate the gradient distribution θ1,1E\partial\theta_{1,1}E within a quantum circuit composed of 10 qubits and 500 layers in in Fig. 7. The distribution profile for the proposed structure and a unitary 22-design are assessed for their respective potentials in gradient optimization. The unitary 22-design structure exhibits a steep, narrow distribution of gradient values, closely centralized near zero and conforming to a normal distribution expressed as 𝒩(0,2.738×104)\mathcal{N}(0,2.738\times 10^{-4}), where 𝒩\mathcal{N} denotes the normal distribution with a mean of 0 and a variance of 2.738×1042.738\times 10^{-4}. This highlights the barren plateaus phenomenon, where optimization becomes challenging due to vanishing gradients. In contrast, the proposed structure demonstrates a substantially wider distribution of gradient values, conforming to a normal distribution expressed as 𝒩(0,1.655×102)\mathcal{N}(0,1.655\times 10^{-2}), indicating a decreased propensity for gradients to converge to zero. This increases the likelihood of locating non-trivial gradient values conducive to effective optimization. The proposed architecture outperforms the other structure by avoiding barren plateaus facilitating more robust quantum circuit training and optimization.

Refer to caption
Figure 7: Gradient Distribution θ1,1E\partial\theta_{1,1}E in a 10-Qubit, 500-Layer Quantum Circuit. The orange and blue histograms depict the frequency of gradient values for the proposed and the unitary 22-design structure, respectively. The red and green curves represent Gaussian kernel-density estimates fitted to the histograms, capturing the distribution trends for each structure.

Finally, we conduct four separate experiments in Fig. 8 to evaluate the effectiveness of the unitary 22-design and proposed structure methodologies in achieving fixed target values of -0.1, -0.05, 0.05, and 0.1 in expectation and cost function. Each experiment is configured with a 10-qubit, 100-layer structure. The proposed structure uses a pending layer setting of 1 and is trained for 10 epochs per pending structure, totaling 1000 epochs. The cost function is (expectationtarget)2(\mathrm{expectation}-\mathrm{target})^{2} and the optimizer is pennylane.AdamOptimizer() [28].

Over the 1000 observed epochs for each experiment in Fig. 8, the proposed structure’s expectations tend to cluster closely around the target line, unlike the unitary 22-design, whose data points are notably more dispersed. The proposed structure’s convergence towards the target value indicates superior performance, highlighting its effectiveness over the unitary 22-design in achieving the desired outcomes.

Calculating the probability distribution of expected values across 300 to 1000 epochs allows us to determine the variability of the predicted value. A narrower peak indicates a more consistent approach to the target, while a wider distribution indicates a greater variance. The central peak of each distribution corresponds to the most frequently occurring expected value within 1000 periods. Ideally, these peaks should be the targets. Compared to the unitary 22-design structure, the proposed structure’s distribution around the target value is sharper and has larger peaks. This indicates that the proposed structure is closer to the target’s expected value.

The cost function is a measure used to evaluate the model’s performance, with lower values indicating better performance. The proposed structure is more likely to obtain effective gradients during training, allowing its cost function maintain at a level less than 10410^{-4}. In contrast, the unitary 22-design structure cannot achieve this, preventing the cost function from converging to a smaller value.

Refer to caption
Figure 8: Comparative Analysis of 10-qubit and 100-layer circuits across four experiments, targeting values at -0.1, -0.05, 0.05, and 0.1. Each row represents one experiment, showcasing epoch-wise expectation results (left), the frequency distribution of expectation (middle), and epoch-cost function value (right). In Epoch-Expectation, the horizontal grey line denotes the target, with red points for the proposed method’s outputs and green points for the unitary 22-design structure’s outputs. In Expectation-Frequency, blue and yellow bars depict the occurrence frequency of the proposed method and the unitary 22-design structure’s expectations, respectively, with green and red curves representing the Gaussian kernel-density estimate fit distribution. In the Epoch-Cost Function, red points show the proposed method’s cost values, while green points are for the unitary 22-design structure’s cost values.

V Conclusion

This paper addressed a central issue in quantum computing – the barren plateaus phenomenon. This phenomenon presents a significant challenge in quantum machine learning and optimization algorithms. Large-scale quantum circuits are characterized by a vanishing gradient variance, which must be addressed. Our approach successfully transforms the quantum circuit from a unitary 22-design to a unitary 11-design without changing the original structure, marking a significant stride in gradient optimization. We introduced auxiliary control qubits that can be eliminated to achieve this transformation.

Our experiments systematically demonstrated the advantages of the proposed structure over global unitary 22-design quantum circuits. The entanglement and elimination of auxiliary qubits facilitate the gradual transition from a unitary 22-design to a unitary 11-design. This approach effectively mitigates the challenges posed by barren plateaus. The experimental data consistently indicated that the proposed structure maintains a stable gradient variance, avoiding the exponential decline associated with increased qubits. This is a notable issue in global unitary 22-design structures. Moreover, the proposed structure showcases a diminished sensitivity to the number of layers, maintaining effective gradients more reliably than the unitary 22-design, particularly in circuits with more layers.

The analysis of gradient distributions emphasizes the superiority of the proposed structure. The unitary 22-design is prone to narrow gradient distributions, which centralize near zero and indicate barren plateaus. In contrast, the proposed structure exhibits a broader gradient distribution, increasing the probability of achieving non-trivial gradients and successful optimization.

In achieving fixed target values for the expectation and cost function, the proposed structure demonstrates a pronounced ability to align with target values closely. It outperforms the unitary 22-design in both consistency and cost function minimization. This is evidenced by the sharper and taller peaks in the probability distribution of expected values and the ability to maintain the cost function consistently low throughout the training process.

Our method has demonstrated significant efficacy in random parameterized quantum circuits (RPQCs). We have experimentally validated our approach to RPQCs, demonstrating its theoretical feasibility and practical applicability. Importantly, our method achieves these benefits without introducing additional variables or significantly altering the existing quantum circuit structure. This substantially reduces implementation complexity and cost in practical applications.

The methodology and results of this study may pave new pathways in the field of quantum computing. Our study will be extended to other types of quantum circuits, such as quantum neural networks (QNNs). Applying the same method in these circuits is expected to yield positive results in gradient optimization. This strategy effectively addresses the barren plateaus problem and opens up new opportunities for designing and optimizing future quantum algorithms.

In summary, this study proposes an innovative and practical method that effectively solves an essential issue in quantum computing. Our study aims to enhance the comprehension and optimization of gradient behaviors in large-scale quantum circuits. It provides valuable guidance for future gradient optimization strategies in quantum machine learning and other advanced quantum computing applications.

The code used for this work is released in [29].

References

Appendix A Haar Measure and Weingarten Function

The unitary representation is introduced in this section. [27] presents the following theorem. Let U(N)U(N) denote the unitary group of degree NN. The irreducible representations of U(N)U(N) which occur in (N)r(N)s(\mathbb{C}^{N})^{\otimes r}\otimes(\mathbb{C}^{N*})^{\otimes s} are precisely those indexed by non-increasing, length-NN integer sequences μ=(μ1,μ2,,μN)\mu=(\mu_{1},\mu_{2},\dots,\mu_{N}), under the conditions:

  1. 1.

    The number of the elements: |μ|=rs\absolutevalue{\mu}=r-s

  2. 2.

    The number of the positive elements: |μ+|r\absolutevalue{\mu_{+}}\leq r

Furthermore, the dimension of each irreducible representation indexed by such a sequence μ\mu is given by:

dμ=1ijNμiμj+jiji.\displaystyle d_{\mu}=\prod_{1\leq i\leq j\leq N}\frac{\mu_{i}-\mu_{j}+j-i}{j-i}. (16)

For positive integers NN, rr, and ss, the total dimension, denoted as D(N,r,s)D(N,r,s), contributed by these representations to the tensor product space is calculated by the square sum of the dimensions of all such irreducible representations that satisfy the conditions above:

D(N,r,s):=|μ|=rs|μ+|rdμ2.\displaystyle D(N,r,s):=\sum_{\begin{subarray}{c}\absolutevalue{\mu}=r-s\\ \absolutevalue{\mu_{+}}\leq r\end{subarray}}d_{\mu}^{2}. (17)

Then think about a special haar measure [31]: Let NN be a positive integer and i=(1,2,,ip)\vec{i}=(1,2,\dots,i_{p}), j=(1,2,,iq)\vec{j}=(1,2,\dots,i_{q}) be tuples of positive integers from (1,2,,N)(1,2,\dots,N). Then,

IN,p,q\displaystyle I_{N,p,q} =U(N)𝑑μUi1j1Ui2j2UipjpUi1j1Uiqjq={0,if pqσ,τSpδ(σ,τ)Wg(N,στ1),if p=q\displaystyle=\int_{U(N)}d\mu\cdot U_{i_{1}j_{1}}U_{i_{2}j_{2}}\dots U_{i_{p}j_{p}}U_{i^{\prime}_{1}j^{\prime}_{1}}^{\ast}\dots U_{i^{\prime}_{q}j^{\prime}_{q}}^{\ast}=\begin{cases}0,&\text{if }p\neq q\\ \sum_{\sigma,\tau\in S_{p}}\delta_{(\sigma,\tau)}Wg(N,\sigma\tau^{-1}),&\text{if }p=q\end{cases} (18)

In this function, δ(σ,τ)=δi1iσ(1)δiqiσ(q)δj1jτ(1)δjqjτ(q)\delta_{(\sigma,\tau)}=\delta_{i_{1}i^{\prime}_{\sigma(1)}}\cdots\delta_{i_{q}i^{\prime}_{\sigma(q)}}\delta_{j_{1}j^{\prime}_{\tau(1)}}\cdots\delta_{j_{q}j^{\prime}_{\tau(q)}} and WgWg is the Weingarten function [30], given by

Wg(N,σ)=1q!2λχλ(1)2χλ(σ)sλ,N(1),\displaystyle Wg(N,\sigma)=\frac{1}{q!^{2}}\sum_{\lambda}\frac{\chi^{\lambda}(1)^{2}\chi^{\lambda}(\sigma)}{s_{\lambda,N}(1)}, (19)

where the sum over all partitions λ\lambda of qq. The character corresponding to the partition λ\lambda is represented by χλ\chi^{\lambda}, and ss is the Schur polynomial of λ\lambda. Therefore, sλ,N(1)s_{\lambda,N}(1) represents the dimension of the representation of U(N)U(N) corresponding to λ\lambda.

Then, we require certain conclusions about the Haar measure, which can be proved by Equation (18). First, when UU is a unitary 11-design, the following equation holds due to the result that Wg(N,(1))=1NWg(N,(1))=\frac{1}{N}.

Tr(𝑑μUAUB)=Tr(A)Tr(B)N.\displaystyle\Tr{\int d\mu\cdot U^{\dagger}AUB}=\frac{\Tr{A}\Tr{B}}{N}. (20)

Similarly, Equation (21) is valid when UU represents a unitary 22-design, with Wg(N,(1,1))=1N21Wg(N,(1,1))=\frac{1}{N^{2}-1} and Wg(N,(2))=1N3NWg(N,(2))=-\frac{1}{N^{3}-N}.

Tr(𝑑μUAUBUCUD)\displaystyle\Tr{\int d\mu\cdot U^{\dagger}AUBU^{\dagger}CUD} (21)
=\displaystyle= Tr(A)Tr(C)Tr(BD)+Tr(AC)Tr(B)Tr(D)N21Tr(A)Tr(B)Tr(C)Tr(D)+Tr(AC)Tr(BD)N3N.\displaystyle\frac{\Tr{A}\Tr{C}\Tr{BD}+\Tr{AC}\Tr{B}\Tr{D}}{N^{2}-1}-\frac{\Tr{A}\Tr{B}\Tr{C}\Tr{D}+\Tr{AC}\Tr{BD}}{N^{3}-N}.

Appendix B Detail of Entanglement with Auxiliary Qubits

To ensure clarity and focus, we present the circuit structure of an nn-qubit RPQCs entangled with log2n\lceil\log_{2}n\rceil-qubit auxiliary qubits as a single-layer configuration for a parameterized layer, shown in Fig. 9.

Before measurement, the unitary operator of this circuit appears as follows:

U=\displaystyle U= i=1n[ρi¯I+ρiUi]iIi¯n1[i=1log2nRy(ϕi)In],\displaystyle\prod_{i=1}^{n}[\rho_{\bar{i}}\otimes I+\rho_{i}\otimes U_{i}]_{i}\otimes I_{\bar{i}}^{\otimes n-1}\cdot\left[\bigotimes_{i=1}^{\lceil\log_{2}n\rceil}R_{y}(\phi_{i})\otimes I^{\otimes n}\right], (22)

where Ii¯n1I_{\bar{i}}^{\otimes n-1} is the identity matrix on all qubits except qubit ii, ρi=|ii|\rho_{i}=\ket{i}\bra{i} and ρi¯=Ilog2nρi\rho_{\bar{i}}=I^{\otimes\lceil\log_{2}n\rceil}-\rho_{i}. After measurement, the unitary operator of the parameterized layer is the partial trace for auxiliary qubits, the operator will transfer to

Tra{U}=\displaystyle\Tr_{a}\{U\}= i=1n[Tr(Ry(ϕi)ρi¯)I+Tr(Ry(ϕi)ρi)Ui]iIi¯n1\displaystyle\prod_{i=1}^{n}[\Tr{R_{y}(\phi_{i})\rho_{\bar{i}}}\otimes I+\Tr{R_{y}(\phi_{i})\rho_{i}}\otimes U_{i}]_{i}\otimes I_{\bar{i}}^{\otimes n-1} (23)
=\displaystyle= i=1n[αiI+βiUi]iIi¯n1\displaystyle\prod_{i=1}^{n}[\alpha_{i}I+\beta_{i}U_{i}]_{i}\otimes I_{\bar{i}}^{\otimes n-1}
=\displaystyle= i=1nαiI+βiUi.\displaystyle\bigotimes_{i=1}^{n}\alpha_{i}I+\beta_{i}U_{i}.
\Qcircuit@C=1em@R=.7em\lstick|0&\gateRy(ϕ1)\ctrl1\ctrl2\ctrlo3\ctrlo4\meter\lstick|0\gateRy(ϕ2)\ctrl1\ctrlo2\ctrl3\ctrlo4\meter\lstick|ψ0\qw\gateU0\qw\qw\qw\qw\lstick|ψ1\qw\qw\gateU1\qw\qw\qw\lstick|ψ2\qw\qw\qw\gateU2\qw\qw\lstick|ψ3\qw\qw\qw\qw\gateU3\qw\Qcircuit@C=1em@R=.7em{\lstick{\ket{0}}&\gate{R_{y}(\phi_{1})}\ctrl{1}\ctrl{2}\ctrlo{3}\ctrlo{4}\meter\\ \lstick{\ket{0}}\gate{R_{y}(\phi_{2})}\ctrl{1}\ctrlo{2}\ctrl{3}\ctrlo{4}\meter\\ \lstick{\ket{\psi_{0}}}\qw\gate{U_{0}}\qw\qw\qw\qw\\ \lstick{\ket{\psi_{1}}}\qw\qw\gate{U_{1}}\qw\qw\qw\\ \lstick{\ket{\psi_{2}}}\qw\qw\qw\gate{U_{2}}\qw\qw\\ \lstick{\ket{\psi_{3}}}\qw\qw\qw\qw\gate{U_{3}}\qw\\ }
Figure 9: This simplified figure illustrates how to entangle an nn-qubit single-layer configuration with log2n\lceil\log_{2}n\rceil auxiliary qubits for a parameterized layer. The unitary gate UiU_{i} represents the rotation gate in RPQCs.

Then we prove the conclusions about EE^{\prime}. A parameterized quantum circuit can be characterized by a sequential application of unitary operations U(𝜽)U(\boldsymbol{\theta}) is defined as

U(𝜽)=l=1LUl(θ𝐥)Wl=l=1Li=1neiθi,lViWl,\displaystyle U(\boldsymbol{\theta})=\prod_{l=1}^{L}U_{l}(\mathbf{\theta_{l}})W_{l}=\prod_{l=1}^{L}\prod_{i=1}^{n}e^{-i\theta_{i,l}V_{i}}W_{l}, (24)

where Ul(θl)U_{l}(\theta_{l}) and WlW_{l} are unitary operators. Here, ViV_{i} is a Pauli operator, and WlW_{l} is a fixed unitary operator that does not depend on the angle θi,l\theta_{i,l}. If we calculate the gradient of UU, we can get the result that

kU(𝜽)=l=kLUl(θ𝐥)Wl(iVk)l=1k1Ul(θ𝐥)Wl=U+(iVk)U=iR.\displaystyle\partial_{k}U(\boldsymbol{\theta})=\prod_{l=k}^{L}U_{l}(\mathbf{\theta_{l}})W_{l}(-iV_{k})\prod_{l=1}^{k-1}U_{l}(\mathbf{\theta_{l}})W_{l}=U_{+}(-iV_{k})U_{-}=-iR. (25)

Then, we utilize the structure with auxiliary qubits in Fig. 9, which enables us to derive the objective function as follows

E(𝜽)\displaystyle E^{\prime}(\boldsymbol{\theta}) =0|(αI+βU(𝜽))H(αI+βU(𝜽))|0\displaystyle=\bra{0}(\alpha I+\beta U(\boldsymbol{\theta}))^{\dagger}H(\alpha I+\beta U(\boldsymbol{\theta}))\ket{0} (26)
=α20|H|0+αβ[0|HU|0+0|UH|0]+β2E.\displaystyle=\alpha^{2}\bra{0}H\ket{0}+\alpha\beta[\bra{0}HU\ket{0}+\bra{0}U^{\dagger}H\ket{0}]+\beta^{2}E.

Next, we compute the gradient of E(θ)E^{\prime}(\mathbf{\theta}), and the first component is 0. The remaining part can be calculated below using Equation (25).

kE\displaystyle\partial_{k}E^{\prime} =αβ(i0|HR|0+i0|RH|0)+β2(kE)\displaystyle=\alpha\beta(-i\bra{0}HR\ket{0}+i\bra{0}R^{\dagger}H\ket{0})+\beta^{2}(\partial_{k}E) (27)
=iαβ0|[R,H]|0+β2(kE).\displaystyle=i\alpha\beta\bra{0}[R,H]\ket{0}+\beta^{2}(\partial_{k}E).

Given that pqp\neq q in Equation (18) and considering the condition kE=0\langle\partial_{k}E\rangle=0, the expectation of the gradient is zero.

Then, we need to calculate the variance. Because of the function that

Var[kE]\displaystyle\mathrm{Var}[\partial_{k}E^{\prime}] =(kE)2kE2=(kE)2.\displaystyle=\langle(\partial_{k}E^{\prime})^{2}\rangle-\langle\partial_{k}E^{\prime}\rangle^{2}=\langle(\partial_{k}E^{\prime})^{2}\rangle. (28)

So we have to find the second-order moment of kE\partial_{k}E^{\prime}. We can obtain that

(kE)2\displaystyle(\partial_{k}E^{\prime})^{2} =α2β20|[R,H]|00|[R,H]|0+2iαβ30|[R,H]|0(kE)+β4(kE)2.\displaystyle=-\alpha^{2}\beta^{2}\bra{0}[R,H]\ket{0}\bra{0}[R,H]\ket{0}+2i\alpha\beta^{3}\bra{0}[R,H]\ket{0}(\partial_{k}E)+\beta^{4}(\partial_{k}E)^{2}. (29)

Because of the Equation (18), the Haar measure of the second part is zero. And the third part is the result of Equation (II). So we only need to calculate the Haar measure about the first part. We use the symbol that ρ=|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|\rho=\ket{\mathbf{init}}\bra{\mathbf{init}}, the first part will be

𝑑μ𝐢𝐧𝐢𝐭|[R,H]|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|[R,H]|𝐢𝐧𝐢𝐭\displaystyle\int d\mu\cdot\bra{\mathbf{init}}[R,H]\ket{\mathbf{init}}\bra{\mathbf{init}}[R,H]\ket{\mathbf{init}} (30)
=\displaystyle= 𝑑μTr(ρ(RHHR)ρ(RHHR))\displaystyle\int d\mu\cdot\Tr{\rho(R^{\dagger}H-HR)\rho(R^{\dagger}H-HR)}
=\displaystyle= Tr(ρ𝑑μ(RHρRH+HRρHR))\displaystyle\Tr{\rho\int d\mu\cdot(R^{\dagger}H\rho R^{\dagger}H+HR\rho HR)}
\displaystyle- 2Tr(ρ𝑑μHRρRH).\displaystyle 2\Tr{\rho\int d\mu\cdot HR\rho R^{\dagger}H}.

Observing this equation, we can notice that the first and second parts are zero, because of Equation (18). So we need to calculate the third Haar measure. By using the conclusion of unitary 11-design of Equation (20), the third part can be calculated as follows,

Tr(ρ𝑑μHRρRH)\displaystyle\Tr{\rho\int d\mu\cdot HR\rho R^{\dagger}H} (31)
=\displaystyle= Tr(ρ𝑑μ𝑑μ+HU+VkUρ(U+VkU)H)\displaystyle\Tr{\rho\int d\mu_{-}\int d\mu_{+}\cdot HU_{+}V_{k}U_{-}\rho(U_{+}V_{k}U_{-})^{\dagger}H}
=\displaystyle= Tr(HρH𝑑μ[𝑑μ+U+VkUρUVkU+])\displaystyle\Tr{H\rho H\int d\mu_{-}\cdot\left[\int d\mu_{+}\cdot U_{+}V_{k}U_{-}\rho U_{-}^{\dagger}V_{k}^{\dagger}U_{+}^{\dagger}\right]}
=\displaystyle= Tr(HρH𝑑μ[1NTr(VkUρUVk)])\displaystyle\Tr{H\rho H\int d\mu_{-}\cdot\left[\frac{1}{N}\Tr{V_{k}U_{-}\rho U_{-}^{\dagger}V_{k}^{\dagger}}\right]}
=\displaystyle= Tr(HρH[1NTr(ρ)])\displaystyle\Tr{H\rho H\cdot\left[\frac{1}{N}\Tr{\rho}\right]}
=\displaystyle= 1NTr(H2ρ)Tr(ρ).\displaystyle\frac{1}{N}\Tr{H^{2}\rho}\Tr{\rho}.

Upon consolidating all the above equations, we can arrive at the final result that

Var[kE]=2α2β2NTr(H2ρ)Tr(ρ)+β4(kE)2.\displaystyle\mathrm{Var}[\partial_{k}E^{\prime}]=\frac{2\alpha^{2}\beta^{2}}{N}\Tr{H^{2}\rho}\Tr{\rho}+\beta^{4}\langle(\partial_{k}E)^{2}\rangle. (32)

Appendix C Detail of Eliminate Auxiliary Qubits

In the parameterized layers, we use the same assumptions as before, except that here we have set α=β=12\alpha=\beta=\frac{1}{\sqrt{2}}. Since for most of the training process, the number of pending layers is relatively small compared to the number of adjustable and fixed layers, and since this part contains parameters, we integrate this section into the adjustable layers for computational convenience. 𝝋\boldsymbol{\varphi} in fixed layers represents a set of fixed parameters exempt from training. The unitary operator of this assumption is 12U(𝝋)(I+U(𝜽))\frac{1}{\sqrt{2}}U^{\prime}(\boldsymbol{\varphi})(I+U(\boldsymbol{\theta})). So, the expectation of E′′E^{\prime\prime} is

E′′(𝜽)=\displaystyle E^{\prime\prime}(\boldsymbol{\theta})= 12𝐢𝐧𝐢𝐭|(I+U(𝜽))U(φ)HU(φ)(I+U(𝜽))|𝐢𝐧𝐢𝐭\displaystyle\frac{1}{2}\langle{\mathbf{init}}|(I+U(\boldsymbol{\theta}))^{\dagger}U^{\prime}(\varphi)^{\dagger}HU^{\prime}(\varphi)(I+U(\boldsymbol{\theta}))|\mathbf{init}\rangle (33)
=\displaystyle= Tr(ρ2(I+U(𝜽))Hφ(I+U(𝜽)))\displaystyle\Tr{\frac{\rho}{2}(I+U(\boldsymbol{\theta}))^{\dagger}H_{\varphi}(I+U(\boldsymbol{\theta}))}
=\displaystyle= Tr(ρ2(Hφ+UHφ+HφU+UHφU)),\displaystyle\Tr{\frac{\rho}{2}(H_{\varphi}+U^{\dagger}H_{\varphi}+H_{\varphi}U+U^{\dagger}H_{\varphi}U)},

where Hφ=U(𝝋)HU(𝝋)H_{\varphi}=U^{\prime}(\boldsymbol{\varphi})^{\dagger}HU^{\prime}(\boldsymbol{\varphi}).

Using the Equation (25), the first part of E′′E^{\prime\prime} is zero, and the gradient of E′′E^{\prime\prime} is:

kE′′\displaystyle\partial_{k}E^{\prime\prime} =Tr(iρ2(RHφHφR+RHφUUHφR)\displaystyle=\Tr{\frac{i\rho}{2}(R^{\dagger}H_{\varphi}-H_{\varphi}R+R^{\dagger}H_{\varphi}U-U^{\dagger}H_{\varphi}R} (34)
=i2Tr(ρ[R,Hφ]+Hφ,+[ρ,Vk]),\displaystyle=\frac{i}{2}\Tr{\rho[R,H_{\varphi}]+H_{\varphi,+}[\rho_{-},V_{k}]},

where ρ=UρU\rho_{-}=U_{-}^{\dagger}\rho U_{-} and Hφ,+=U+HφU+H_{\varphi,+}=U_{+}^{\dagger}H_{\varphi}U_{+}.

As the pending layer is located on the right side of the entire parameterized structure, only the unitary tt-design of U+U_{+} needs to be considered.

When U+U_{+} is at least a unitary 11-design, Equation (18) leads the first part to zero and Equation (20) acts on the second part. So, the first-order moment can be calculated as follows.

kE′′\displaystyle\langle\partial_{k}E^{\prime\prime}\rangle =𝑑μkE′′\displaystyle=\int d\mu\cdot\partial_{k}E^{\prime\prime} (35)
=i2𝑑μ𝑑μ+Tr(Hφ,+[ρ,Vk])\displaystyle=\frac{i}{2}\int d\mu_{-}\int d\mu_{+}\cdot\Tr{H_{\varphi,+}[\rho_{-},V_{k}]}
=i2𝑑μTr(𝑑μ+U+HφU+[ρ,Vk])\displaystyle=\frac{i}{2}\int d\mu_{-}\cdot\Tr{\int d\mu_{+}\cdot U_{+}^{\dagger}H_{\varphi}U_{+}[\rho_{-},V_{k}]}
=i2N𝑑μTr(Hφ)Tr([ρ,Vk])=0.\displaystyle=\frac{i}{2N}\int d\mu_{-}\cdot\Tr{H_{\varphi}}\Tr{[\rho_{-},V_{k}]}=0.

This result is obtained because the trace of the commutator is zero.

Before calculating the second-order moment, we first solve for (kE′′)2(\partial_{k}E^{\prime\prime})^{2}.

(kE′′)2=\displaystyle(\partial_{k}E^{\prime\prime})^{2}= 14𝐢𝐧𝐢𝐭|RHφHφR|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|RHφHφR|𝐢𝐧𝐢𝐭\displaystyle-\frac{1}{4}\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}-H_{\varphi}R|\mathbf{init}\rangle\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}-H_{\varphi}R|\mathbf{init}\rangle (36)
12𝐢𝐧𝐢𝐭|RHφHφR|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|RHφUUHφR|𝐢𝐧𝐢𝐭\displaystyle-\frac{1}{2}\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}-H_{\varphi}R|\mathbf{init}\rangle\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}U-U^{\dagger}H_{\varphi}R|\mathbf{init}\rangle
14𝐢𝐧𝐢𝐭|RHφUUHφR|𝐢𝐧𝐢𝐭𝐢𝐧𝐢𝐭|RHφUUHφR|𝐢𝐧𝐢𝐭\displaystyle-\frac{1}{4}\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}U-U^{\dagger}H_{\varphi}R|\mathbf{init}\rangle\langle{\mathbf{init}}|R^{\dagger}H_{\varphi}U-U^{\dagger}H_{\varphi}R|\mathbf{init}\rangle
=\displaystyle= 12Tr(ρHφRρRHφ)14Tr(Hφ,+[ρ,Vk]Hφ,+[ρ,Vk])+others.\displaystyle\frac{1}{2}\Tr{\rho H_{\varphi}R\rho R^{\dagger}H_{\varphi}}-\frac{1}{4}\Tr{H_{\varphi,+}[\rho_{-},V_{k}]H_{\varphi,+}[\rho_{-},V_{k}]}+others.

In this function, the first part is a unitary 11-design and the second is a unitary 22-design. The Haar measure of others is zero by the Equation (18). When U+U_{+} is at least a unitary 11-design, Equation (20) acts on the first part.

𝑑μ12Tr(ρHφRρRHφ)\displaystyle\int d\mu\cdot\frac{1}{2}\Tr{\rho H_{\varphi}R\rho R^{\dagger}H_{\varphi}} (37)
=\displaystyle= 12𝑑μTr(𝑑μ+ρHφRρRHφ)\displaystyle\frac{1}{2}\int d\mu_{-}\cdot\Tr{\int d\mu_{+}\cdot\rho H_{\varphi}R\rho R^{\dagger}H_{\varphi}}
=\displaystyle= 12𝑑μTr(𝑑μ+ρHφ(U+VkU)ρ(U+VkU)Hφ)\displaystyle\frac{1}{2}\int d\mu_{-}\cdot\Tr{\int d\mu_{+}\cdot\rho H_{\varphi}(U_{+}V_{k}U_{-})\rho(U_{+}V_{k}U_{-})^{\dagger}H_{\varphi}}
=\displaystyle= 12N𝑑μTr(HφρHφ)Tr(VkUρUVk)\displaystyle\frac{1}{2N}\int d\mu_{-}\cdot\Tr{H_{\varphi}\rho H_{\varphi}}\Tr{V_{k}U_{-}\rho U_{-}^{\dagger}V_{k}^{\dagger}}
=\displaystyle= 12NTr(ρHφ2)Tr(ρ)\displaystyle\frac{1}{2N}\Tr{\rho H_{\varphi}^{2}}\Tr{\rho}

And when U+U_{+} is at least a unitary 22-design, Equation (21) acts on the second part.

dμ\displaystyle\int d\mu\cdot 14Tr(Hφ,+[ρ,Vk]Hφ,+[ρ,Vk])\displaystyle-\frac{1}{4}\Tr{H_{\varphi,+}[\rho_{-},V_{k}]H_{\varphi,+}[\rho_{-},V_{k}]} (38)
=\displaystyle= 14𝑑μTr(𝑑μ+Hφ,+[ρ,Vk]Hφ,+[ρ,Vk])\displaystyle-\frac{1}{4}\int d\mu_{-}\cdot\Tr{\int d\mu_{+}\cdot H_{\varphi,+}[\rho_{-},V_{k}]H_{\varphi,+}[\rho_{-},V_{k}]}
=\displaystyle= 14𝑑μTr(𝑑μ+U+HφU+[ρ,Vk]U+HφU+[ρ,Vk])\displaystyle-\frac{1}{4}\int d\mu_{-}\cdot\Tr{\int d\mu_{+}\cdot U_{+}^{\dagger}H_{\varphi}U_{+}[\rho_{-},V_{k}]U_{+}^{\dagger}H_{\varphi}U_{+}[\rho_{-},V_{k}]}
=\displaystyle= 14(N21)𝑑μ[Tr(Hφ)2Tr([ρ,Vk]2)+Tr(Hφ2)Tr([ρ,Vk])2]+O(N2)\displaystyle-\frac{1}{4(N^{2}-1)}\int d\mu_{-}\cdot\left[\Tr{H_{\varphi}}^{2}\Tr{[\rho_{-},V_{k}]^{2}}+\Tr{H_{\varphi}^{2}}\Tr{[\rho_{-},V_{k}]}^{2}\right]+O(N^{-2})
=\displaystyle= 14(N21)Tr(Hφ)2Tr[ρ,Vk]2U+O(N2)\displaystyle-\frac{1}{4(N^{2}-1)}\Tr{H_{\varphi}}^{2}\Tr\langle[\rho_{-},V_{k}]^{2}\rangle_{U_{-}}+O(N^{-2})

Finally, by using the Equation (28), we can get that

Var[kE′′]={12NTr(ρHφ2)Tr(ρ)+O(N1),if t=112NTr(ρHφ2)Tr(ρ)14(N21)Tr(Hφ2)Tr[ρ,Vk]2U+O(N2),if t=2\displaystyle\mathrm{Var}[\partial_{k}E^{\prime\prime}]=\begin{cases}\frac{1}{2N}\Tr{\rho H_{\varphi}^{2}}\Tr{\rho}+O(N^{-1}),&\text{if }t=1\\ \frac{1}{2N}\Tr{\rho H_{\varphi}^{2}}\Tr{\rho}-\frac{1}{4(N^{2}-1)}\Tr{H_{\varphi}^{2}}\Tr\langle[\rho_{-},V_{k}]^{2}\rangle_{U_{-}}+O(N^{-2}),&\text{if }t=2\end{cases} (39)

We can observe that the result obtained after eliminating the auxiliary qubits is almost identical to that of the entanglement method. This makes it easier to achieve a good gradient as the number of qubits increases.