This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Quantum Information Theoretic View On A Deep Quantum Neural Network

Beatrix C. Hiesmayr [email protected] University of Vienna, Faculty of Physics, Währingerstrasse 1, 1090 Vienna (Austria).
Abstract

We discuss a quantum version of an artificial deep neural network where the role of neurons is taken over by qubits and the role of weights is played by unitaries. The role of the non-linear activation function is taken over by subsequently tracing out layers (qubits) of the network. We study two examples and discuss the learning from a quantum information theoretic point of view. In detail, we show that the lower bound of the Heisenberg uncertainty relations is defining the change of the gradient descent in the learning process. We raise the question if the limit by Nature to two non-commuting observables, quantified in the Heisenberg uncertainty relations, is ruling the optimization of the quantum deep neural network. We find a negative answer.

I Introduction

Refer to caption
Figure 1: This figure sketches how the input vector x(1)\vec{x}^{(1)} is processed by feed forward propagation in an artificial deep neural network. The bias and weights are applied to a non-linear function ff which defines the input for the next layer and so on.

Machine Learning, supervised, unsupervised, reinforcement or GANs (Generative Adversarial Networks), has shown in the recent years impressive successes (e.g. Ref. [1] for an application to photovoltaic systems or Ref. [2] for utilizing machine learning for identification of particles or Ref. [3] utilizing deep networks for automatic cleaning of data). Here we want to raise the question if we can do better with quantum systems. In general there are several approaches and claims, but no clear candidate. We focus on quantum artificial neural networks that utilize qubits as perceptrons. For a general overview over the current perspective of quantum algorithms on a quantum computer in the noisy intermediate-scale quantum (NISQ) era the reader may e.g. be referred to Ref. [4].

Classical neural networks started their success story once hidden layers were introduced in addition to a non-linear activation function. The very working of classical neural networks is shown in Fig. 1. Weights wk,lw_{k,l} and bias bib_{i} at different layers are the parameters that have to be learnt by the training pairs provided in the learning process. In addition an activation function ff has to be chosen which, as it turned out, has to be non-linear in order to guarantee the universal approximation theorem, i.e. that any function can be efficiently approximated by this neural network. The learning of the classical network is given by defining a cost function and utilizing backward propagation, which allows to update the weights and bias over a gradient descent such that the cost function optimizes, i.e. the desired output is reached better and better.

We focus on a quantum version of a classical neural network that interchanges each perceptron with a qubit. The weights and maybe bias are realized with different unitary matrices. The main challenge is to introduce an activation function in the quantum version since the quantum theory is manifestly linear and the unitary evolution is reversible and non-dissipative. This is in strong contrast to classical neural networks which have at their heart nonlinear activation functions and a dissipative dynamics. It is generally open which properties of classical artificial neural networks should be met to call it a meaningful quantum artificial neural network. But this question goes deeper since it generally asks what is the difference between classical and quantum information and its processing.

In this paper we discuss these issues by considering a particular example of a deep quantum artificial neural network.

II A quantum artificial deep neural network

Refer to caption
Figure 2: This figure sketches how a two qubit input state is processed by feed forward propagation in an artificial quantum neural network. Two unitaries U1U_{1} and U2U_{2} are applied followed by a partial trace, the resulting two qubit state is the input for the next layer and so on.

A minimal deep quantum artificial neural network is sketched in Fig. 2. A unitary U1U_{1} acts upon two input qubits and the first qubit of the hidden layer, this is followed by a unitary U2U_{2} that acts upon the two input qubits and the second qubit of the hidden layer. Obviously, the ordering of these two unitaries is important. Then a partial trace is applied to the input layer – first two qubits – resulting in a two qubit state for which the same process is started by two new unitaries U3,U4U_{3},U_{4} followed by a partial trace over the hidden layer, which is then the two–qubit output state of the quantum neural network. The partial trace may be interpreted as the activation function and the parameters of the four unitary operators U1,U2,U3,U4U_{1},U_{2},U_{3},U_{4} as the weights or bias.

For the minimal network of a two qubit input layer (1.1. layer) and a hidden layer of two qubits (2.2. layer) and an output layer of two qubits (3.3. layer) the four unitarities have each the dimension of 88 which implies 82=648^{2}=64 free parameters and in total 4644\cdot 64 parameters. As a cost function we will define the fidelity, which is a measure of the “closeness” of two quantum states. It expresses the probability that one state will pass a test to identify as the other. It is generally defined by

F(ρ,σ)\displaystyle F(\rho,\sigma) =\displaystyle= F(σ,ρ)=(Tr(ρσρ))2,\displaystyle F(\sigma,\rho)\;=\;\biggl{(}Tr(\sqrt{\sqrt{\rho}\sigma\sqrt{\rho}})\biggr{)}^{2}\;, (1)

which reduces in the special case of pure states to the overlap of those two states. For qubits the fidelity reduces also to F(ρ,σ)=Tr(ρσ)+2det(ρ)detσF(\rho,\sigma)=Tr(\rho\sigma)+2\sqrt{\det(\rho)\det{\sigma}}. The fidelity takes values [0,1]\in[0,1] and in the case of 11 the states can be considered as equivalent. The problems that we will consider will have a desired output state Φdesiredz\Phi_{desired}^{z} that is chosen to be pure. This simplifies the loss function to C=Φdesired|ρout|ΦdesiredC=\langle\Phi_{desired}|\rho_{out}|\Phi_{desired}\rangle, which is 11 if the output state of the network ρout\rho_{out} perfectly overlaps with the desired state Φdesired\Phi_{desired} and else [0,1}\in[0,1\}.

II.1 Feed Forward Propagation

The two-qubit output state of the minimal network of “two-qubits–two-qubits–two-qubits” is given by

ρoutz\displaystyle\rho_{out}^{z} =\displaystyle= i,j,k,l=01ijkl|𝟙4U4U3U2U1ρinz|0000||0000|U1U2U3U4|ijkl𝟙4,\displaystyle\sum_{i,j,k,l=0}^{1}\langle ijkl|\otimes\mathbbm{1}_{4}\;U_{4}\;U_{3}\;U_{2}\;U_{1}\;\;\rho_{in}^{z}\otimes|00\rangle\langle 00|\otimes|00\rangle\langle 00|\;\;U_{1}^{\dagger}\;U_{2}^{\dagger}\;U_{3}^{\dagger}\;U_{4}^{\dagger}\;\;|ijkl\rangle\otimes\mathbbm{1}_{4}\;, (2)

where ρinz\rho_{in}^{z} is a given input state, the initial states of the hidden and output layers have been chosen to be in the state |0|0\rangle (w.o.l.g.) and the unitaries only address the subspaces described before.

II.2 Cost Function Optimization - Backward Propagation

The cost function for our problem is then defined by

C({Φdesiredz}z=1N,{ρinz}z=1N,U1,U2,U3,U4)\displaystyle C(\{\Phi_{desired}^{z}\}_{z=1}^{N},\{\rho_{in}^{z}\}_{z=1}^{N},U_{1},U_{2},U_{3},U_{4}) :=\displaystyle:= 1Nz=1NΦdesiredz|ρoutz|Φdesiredz.\displaystyle\frac{1}{N}\sum_{z=1}^{N}\;\langle\Phi_{desired}^{z}|\rho_{out}^{z}|\Phi_{desired}^{z}\rangle\;. (3)

In Ref. [5] a composite parametrization was introduced which will allow us to compute the derivative of the unitaries, for the optimization of the network, analytically. For any unitary operation UU acting on a Hilbert space =d\mathcal{H}=\mathbb{C}^{d} with d2d\geq 2 spanned by the orthonormal basis {|1,,|d}\{|1\rangle,\ldots,|d\rangle\} there exist d2d^{2} real values λm,n\lambda_{m,n} with m,n{1,,d}m,n\in\left\{1,\ldots,d\right\} and λn,n[0,2π]{\color[rgb]{1,0,0}\lambda_{n,n}}\in\left[0,2\pi\right] and λn,m[0,2π]{\color[rgb]{0,0,1}\lambda_{n,m}}\in\left[0,2\pi\right] for m<nm<n and λm,n[0,π2]{\color[rgb]{0,1,0}\lambda_{m,n}}\in\left[0,\frac{\pi}{2}\right] for m<nm<n such that any UUCU\equiv U_{C} with

UC=[m=1d1(n=m+1dexp(iλn,mPn)exp(iλm,nYm,n):=Λm,n)][l=1dexp(iλl,lPl)].\displaystyle U_{C}=\left[\prod_{m=1}^{d-1}\left(\prod_{n=m+1}^{d}\underbrace{\mbox{exp}\left(i\,{\color[rgb]{0,0,1}{\lambda_{n,m}}}\;P_{n}\right)\mbox{exp}\left(i\,{\color[rgb]{0,1,0}{\lambda_{m,n}}}\;Y_{m,n}\right)}_{:=\Lambda_{m,n}}\right)\right]\cdot\left[\prod_{l=1}^{d}\mbox{exp}(i\,{\color[rgb]{1,0,0}{\lambda_{l,l}}}\;P_{l})\right]\ . (4)

The sequence of the product is defined by i=1NAi=A1A2AN\prod_{i=1}^{N}A_{i}=A_{1}\cdot A_{2}\cdots A_{N}. Here, the PlP_{l} are one-dimensional projectors Pl=|l1l1|P_{l}=|l-1\rangle\langle l-1| and Ym,nY_{m,n} are the generalized anti-symmetric Pauli-matrices Ym,n=i|m1n1|+i|n1m1|Y_{m,n}=-i|m-1\rangle\langle n-1|+i|n-1\rangle\langle m-1| with 1m<nd1\leq m<n\leq d.

The parameter λm,n\lambda_{m,n} can be gathered in a d×dd\times dparameterization matrix

Parameterization matrix(λ1,1λ1,2λ1,d1λ1,dλ2,1λ2,2λ2,d1λ1,dλd1,1λd1,2λd1,d1λd1,dλd,1λd,2λd,d1λd,d),\displaystyle\textrm{``{Parameterization matrix}''}\;\equiv\;\left(\begin{array}[]{ccccc}{\color[rgb]{1,0,0}{\lambda_{1,1}}}&{\color[rgb]{0,1,0}{\lambda_{1,2}}}&\cdots&{\color[rgb]{0,1,0}{\lambda_{1,d-1}}}&{\color[rgb]{0,1,0}{\lambda_{1,d}}}\\ {\color[rgb]{0,0,1}{\lambda_{2,1}}}&{\color[rgb]{1,0,0}{\lambda_{2,2}}}&\cdots&{\color[rgb]{0,1,0}{\lambda_{2,d-1}}}&{\color[rgb]{0,1,0}{\lambda_{1,d}}}\\ \vdots&\vdots&\ddots&\vdots\\ {\color[rgb]{0,0,1}{\lambda_{d-1,1}}}&{\color[rgb]{0,0,1}{\lambda_{d-1,2}}}&\cdots&{\color[rgb]{1,0,0}{\lambda_{d-1,d-1}}}&{\color[rgb]{0,1,0}{\lambda_{d-1,d}}}\\ {\color[rgb]{0,0,1}{\lambda_{d,1}}}&{\color[rgb]{0,0,1}{\lambda_{d,2}}}&\cdots&{\color[rgb]{0,0,1}{\lambda_{d,d-1}}}&{\color[rgb]{1,0,0}{\lambda_{d,d}}}\end{array}\right)\ , (10)

where the diagonal entries λn,n{\color[rgb]{1,0,0}{\lambda_{n,n}}} represent global phase transformations, the upper right entries λm,n{\color[rgb]{0,1,0}{\lambda_{m,n}}} are related to rotations in the subspaces spanned by |n|n\rangle and |m|m\rangle, while the lower left entries λn,m{\color[rgb]{0,0,1}{\lambda_{n,m}}} are relative phases in these subspaces (with respect to the basis {|0,,|d1}\{|0\rangle,\ldots,|d-1\rangle\}). Note that for optimization one does not need to restrict the parameter λn,m\lambda_{n,m} to the intervals given above.

Now we want to change the unitaries of the neural network in order to maximize the cost function CC and this for each parameter λxy\lambda_{xy}, i.e. we can consider a Taylor expansion:

U(λx,y+ε)\displaystyle U(\lambda_{x,y}+\varepsilon) \displaystyle\doteq U(λx,y)+εU(λx,y)λx,y+O(ε2)\displaystyle U(\lambda_{x,y})+\varepsilon\frac{\partial U(\lambda_{x,y})}{\partial\lambda_{x,y}}+O(\varepsilon^{2}) (11)
=\displaystyle= U(λx,y)+iεU(λx,y)Y~x,y+O(ε2)\displaystyle U(\lambda_{x,y})+i\;\varepsilon U(\lambda_{x,y})\;\tilde{Y}_{x,y}+O(\varepsilon^{2})

with

Y~xy\displaystyle\tilde{Y}_{xy} =\displaystyle= {Ux,yYx,yUx,yforx<yPxforx=yUx,yPxUx,yforx>y,\displaystyle\left\{\begin{array}[]{l}U^{\dagger}_{x,y}\;Y_{x,y}U_{x,y}\qquad\textrm{for}\quad x<y\\ P_{x}\qquad\qquad\qquad\textrm{for}\quad x=y\\ U^{\dagger}_{x,y}\;P_{x}U_{x,y}\qquad\textrm{for}\quad x>y\;,\end{array}\right. (15)

and

Ux,y={[Πn=y+1d1Λx,n][Πm=x+1d2Πn=m+1d1Λm,n]Πl=xd1eiPlλl,lforx<y[Πn=xd1Λy,n][Πm=y+1d2Πn=m+1d1Λm,n]Πl=yd1eiPlλl,lforx>y\displaystyle U_{x,y}=\left\{\begin{array}[]{l}\left[\Pi_{n=y+1}^{d-1}\Lambda_{x,n}\right]\left[\Pi_{m=x+1}^{d-2}\Pi_{n=m+1}^{d-1}\Lambda_{m,n}\right]\;\Pi_{l=x}^{d-1}e^{iP_{l}\lambda_{l,l}}\qquad\textrm{for}\quad x<y\\ \left[\Pi_{n=x}^{d-1}\Lambda_{y,n}\right]\left[\Pi_{m=y+1}^{d-2}\Pi_{n=m+1}^{d-1}\Lambda_{m,n}\right]\;\Pi_{l=y}^{d-1}e^{iP_{l}\lambda_{l,l}}\qquad\textrm{for}\quad x>y\end{array}\right.\; (18)

where we have used the results of Ref. [5]. Note that Y~\tilde{Y} are hermitian, thus the unitarity condition holds for every order in the expansion. Moreover, note that Y~\tilde{Y} depends for xyx\not=y on all other λ\lambda parameters except λx,y\lambda_{x,y}.

Thus the parameters of the unitaries are changed by λx,yλx,y+δλx,y\lambda_{x,y}\longrightarrow\lambda_{x,y}+\;\delta\lambda_{x,y} which is given for the first perceptron by the term

δλx,y(1)εCzλx,y(1)εij=01ijΦdesiredz|U2U1i[Y~x,y(1),ρ~z]U1U2|ijΦdesiredz,\displaystyle\delta\lambda^{(1)}_{x,y}\approx\varepsilon\frac{\partial C^{z}}{\partial\lambda^{(1)}_{x,y}}\approx\varepsilon\sum_{ij=0}^{1}\langle ij\Phi_{desired}^{z}|\;U_{2}U_{1}\;\;{\color[rgb]{0,0,1}i\;[\tilde{Y}^{(1)}_{x,y},\tilde{\rho}^{z}]}\;\;U_{1}^{\dagger}U_{2}^{\dagger}\;\;|ij\Phi_{desired}^{z}\rangle\;, (19)

and for the second perceptron by the term

δλx,y(2)εCzλx,y(2)εij=01ijΦdesiredz|U2i[Y~x,y(2),U1ρ~zU1]U2|ijΦdesiredz,\displaystyle\delta\lambda^{(2)}_{x,y}\approx\varepsilon\frac{\partial C^{z}}{\partial\lambda^{(2)}_{x,y}}\approx\varepsilon\sum_{ij=0}^{1}\langle ij\Phi_{desired}^{z}|\;U_{2}\;\;{\color[rgb]{0,0,1}i\;[\tilde{Y}^{(2)}_{x,y},U_{1}\tilde{\rho}^{z}U_{1}^{\dagger}]}\;\;U_{2}^{\dagger}\;\;|ij\Phi_{desired}^{z}\rangle\;, (20)

and so on. Here ε\varepsilon can be chosen arbitrarily and in principle different for each unitary and plays the role of a learning parameter in a classical network, i.e. chosen too low the cost function will only increase slowly but chosen too high we may miss the optimum. It is a hyper parameter in the learning process. We chose it for all four unitaries the same in our applications.

Let us emphasize here that each equation forms a Heisenberg uncertainty, in the so called Robertson version [8], i.e.

(ΔY~x,y)ΨΔ(ρ~)Ψ12|iΨ|[Y~x,y,ρ~]|Ψ|\displaystyle(\Delta\tilde{Y}_{x,y})_{\Psi}\Delta(\tilde{\rho})_{\Psi}\;\geq\frac{1}{2}\left|i\;\langle\Psi|\;\left[\tilde{Y}_{x,y},\tilde{\rho}\right]\;|\Psi\rangle\right|\; (21)

where the Δ\Delta is the standard deviation of the operator with respect to Ψ\Psi. Clearly, there are only two ways how the lower bound on a Heisenberg uncertainty can vanish, either the two observables are commuting or the state Ψ\Psi has a spectrum of zero. The first way is the general foundational limit provided by Nature, if two observables are not commuting, for instance the famous position operator x^\hat{x} and momentum operator p^\hat{p}, we have [x^,p^]=i𝟙[\hat{x},\hat{p}]=i\hbar\mathbbm{1} and therefore, for all possible states Ψ\Psi the lower bound is 2\frac{\hbar}{2}. Differently stated, there exists no states for which the standard deviation of the position and momentum can be smaller than this value.

On the other hand if we consider e.g. Pauli operators σi\sigma_{i}, then the commutator is [σi,σj]=2iεijkσk[\sigma_{i},\sigma_{j}]=2i\varepsilon_{ijk}\sigma_{k} and thus the lower bound gives

|Ψ|σk|Ψ|\displaystyle|\langle\Psi|\sigma_{k}|\Psi\rangle| (22)

which may be a non-zero value for a general Ψ\Psi, but choosing an appropriate Ψ\Psi it may still vanish though the Pauli operators do not commute. This property of the Robertson version of the Heisenberg uncertainty relation was criticized and an entropic version was found overcoming this issue, which we discuss in the conclusions. The question we want to discuss first is if in the optimization process of the quantum net, these fundamental limits are utilized.

III Examples and Results

Here we present two different examples by increasing the complexity of the general problem.

III.1 Example A: Learning A Single Unitary

Let us start with a simple example, namely learning a particular unitary VV, that was first considered in Ref. [6]. In Ref [13] this quantum neural network was applied to the real data of the Iris flower and the performance of this quantum network was compared with other networks. The ground truth is then given by choosing zz arbitrary states ρz\rho^{z} and computing the desired output by ρdesiredz=Vρinz\rho_{desired}^{z}=V\rho_{in}^{z}. The goal is that the network learns this unitary (generally only 1616 parameters) by optimizing the 4644\cdot 64 parameters of the network by utilizing the cost function. One can fix ε\varepsilon, which may be interpreted as a learning parameter, in each round but we optimize ε\varepsilon by taking the maximum of the cost function for the computed corrections of all 44 unitaries. We chose randomly 100100 pairs and used 1010 for the optimization and 9090 for the validation. The software for implementation was Mathematica Wolfram.

In Fig. 3 we plotted the cost function and the validation function for different training rounds. At each training round the cost function is plotted as a function of ε\varepsilon and its maximum is taken. As can be seen the curves are monotonically increasing at each round but the convergence is slow.

A typical update for the λx,y\lambda_{x,y} looks like (example for U1U_{1}, round 5555)

(0.01830270.01393610.0005870360.001779870.001894220.004949030.004769280.005217790.011020700.004413540.001385280.004581570.0001611810.0002019730.01106720.00004990630.0001433310.008828440.005176840.005012750.00263750.006084510.007466860.0006661530.0001086060.0018562900.005844620.003598620.00005937910.0001238620.003634720.007579160.002483050.003105110.01061220.001918960.0001866930.0006540470.001628550.01331820.0006493960.002317920.00189200.001853190.001419760.0175590.00120030.005312240.0003938570.002135330.0005251670.001137940.002428880.00797730.00823460.008082960.007384330.007110410.0000575340.000430830)\displaystyle\tiny{\left(\begin{array}[]{cccccccc}0.0183027&-0.0139361&0.000587036&-0.00177987&0.00189422&-0.00494903&0.00476928&0.00521779\\ -0.0110207&0&-0.00441354&0.00138528&0.00458157&0.000161181&-0.000201973&-0.0110672\\ -0.0000499063&-0.000143331&-0.00882844&0.00517684&-0.00501275&-0.0026375&0.00608451&0.00746686\\ -0.000666153&-0.000108606&-0.00185629&0&0.00584462&-0.00359862&0.0000593791&0.000123862\\ -0.00363472&0.00757916&-0.00248305&-0.00310511&-0.0106122&0.00191896&0.000186693&-0.000654047\\ 0.00162855&-0.0133182&0.000649396&0.00231792&0.001892&0&-0.00185319&0.00141976\\ 0.017559&0.0012003&-0.00531224&-0.000393857&-0.00213533&0.000525167&0.00113794&0.00242888\\ -0.0079773&-0.0082346&-0.00808296&-0.00738433&-0.00711041&-0.000057534&0.00043083&0\\ \end{array}\right)} (31)

or for (example for U1U_{1}, round 6060)

(0.003109930.004804530.0002109510.0009357060.001806150.007401560.002052140.009796870.0016921800.0003432610.0008311330.0001243290.001835960.001030770.008489360.0001580180.0001503250.0003342370.003433780.002335810.001653550.002015450.006092940.0001313290.0001432960.0011084200.002921210.002478190.004847490.0003208370.003702070.006838780.0004968480.002164350.004321880.0003358350.002367450.000401230.003083590.01099180.001174930.0009513080.00098279400.001446470.001506260.003979760.002642690.002765740.001476980.003898120.00006691850.001546180.001259580.001168580.0006675120.0008105260.00284710.001137760.00005915760.000004390)\displaystyle\tiny{\left(\begin{array}[]{cccccccc}-0.00310993&-0.00480453&-0.000210951&-0.000935706&0.00180615&-0.00740156&0.00205214&-0.00979687\\ 0.00169218&0&0.000343261&-0.000831133&0.000124329&-0.00183596&0.00103077&-0.00848936\\ -0.000158018&-0.000150325&0.000334237&-0.00343378&0.00233581&-0.00165355&0.00201545&0.00609294\\ -0.000131329&0.000143296&0.00110842&0&-0.00292121&0.00247819&0.00484749&-0.000320837\\ -0.00370207&0.00683878&0.000496848&0.00216435&0.00432188&0.000335835&0.00236745&-0.00040123\\ 0.00308359&-0.0109918&0.00117493&0.000951308&-0.000982794&0&-0.00144647&0.00150626\\ -0.00397976&0.00264269&-0.00276574&0.00147698&0.00389812&0.0000669185&-0.00154618&0.00125958\\ 0.00116858&0.000667512&0.000810526&-0.0028471&-0.00113776&-0.0000591576&0.00000439&0\\ \end{array}\right)} (40)

which is the sum of all training states defining the lower bound in the Heisenberg relation and does not vanish. The zeros are due to the fact that we have chosen the hidden layer and output layer states to be |0|0\rangle. In Fig. 4 we show how the cost function typically changes with the learning parameter ε\varepsilon. It is quite constrained if all four unitaries are included, but for a single one the parameter space is quite flat. This suggests that the interplay of all four unitaries is relevant for the problem, but the constraint due to each single unitary does not do the job.

Even though we are close to the maximum cost function value it seems that the derivatives do not vanish. To see if this is due the average over in this case 1010 pairs, we picked one out and optimized it to a cost function value of 0.9990030.999003, the corrections terms are still of same order as above. This suggests that the neural network does not optimize the uncertainty relation but the parameters of the unitaries which are not unique. Let us choose now a non-trivial problem.

Refer to caption
Figure 3: The curves show the (orange-blue) cost function (1010 pairs) and the (red-brown) validation function (9090 pairs) in dependence of the training rounds for learning a single unitary. Actually, the cost function started with a value of 0.250.25, we show here only the optimization after the value 0.60.6 was reached.
Refer to caption
Figure 4: The curves show the cost function CC varied with the learning parameter ε\varepsilon for a typical step in the optimization process. If all corrections to the four unitaries are included, the maximum is quite pronounced and with that the choice of the optimal ε\varepsilon for the problem at each round. Varying only one unitary the maximum is less pronounced, i.e. the problem is less constrained. Further note that the problem is not symmetric due to the composite parameterization of the unitaries.

III.2 Example B: A State Learning Its Own Quantum Properties?!

Now we create pairs such that the desired output state encodes the quantum properties, i.e. its purity 𝖯(ρ)=43(Tr(ρ2)14)\mathsf{P}(\rho)=\frac{4}{3}(Tr(\rho^{2})-\frac{1}{4}) and its entanglement property. For that purpose we choose the concurrence [7], a computable measure for bipartite qubits. The concurrence 𝖢𝗈𝗇(ρ)\mathsf{Con}(\rho) is defined as maximal between zero and the maximal eigenvalue minus the other three eigenvalues of the quantity ρσ2σ2ρσ2σ2ρ\sqrt{\sqrt{\rho}\sigma_{2}\otimes\sigma_{2}\rho^{*}\sigma_{2}\otimes\sigma_{2}\sqrt{\rho}} with σ2\sigma_{2} being the yy-Pauli matrix. For pure states it simplifies to 𝖢𝗈𝗇(ψ)=|ψ|σyσy|ψ|\mathsf{Con}(\psi)\;=\;|\langle\psi|\;\sigma_{y}\otimes\sigma_{y}|\psi^{*}\rangle|.

Our desired output states are chosen to be

|Φdesiredz\displaystyle|\Phi^{z}_{desired}\rangle =\displaystyle= 𝖢𝗈𝗇(ψinz)2|00+1𝖢𝗈𝗇(ψinz)+𝖯(ψinz)212(|01+|10)+𝖯(ψinz)2|11.\displaystyle\sqrt{\frac{\mathsf{Con}(\psi_{in}^{z})}{2}}\;|00\rangle+\sqrt{1-\frac{\mathsf{Con}(\psi_{in}^{z})+\mathsf{P}(\psi_{in}^{z})}{2}}\frac{1}{\sqrt{2}}(|01\rangle+|10\rangle)+\sqrt{\frac{\mathsf{P}(\psi_{in}^{z})}{2}}\;|11\rangle\;. (41)

This means that each pair is again connected by a unitary (if we assume only pure input states), |Φ𝖽𝖾𝗌𝗂𝗋𝖾𝖽z=Uψinz|ψinz|\Phi^{z}_{\mathsf{desired}}\rangle=U_{\psi^{z}_{in}}|\psi^{z}_{in}\rangle, but it is chosen according to the quantum properties of the input states. Thus the net needs to learn a set of unitaries defined by the quantum properties (entanglement&purity) of the arbitrary input state. Consequently, the question is whether the neural net processes also the properties of the state itself or if only the information of the training pair is exploited as it would be the case in a classical neural network.

We tried different sets for the training and here we discuss the result for 7070 pairs for the training. The convergence is even slower than for the problem of single unitaries. We find typically a cost function value of 0.910.91 with a standard deviation of 0.080.08. If we use for the validation only 3030 pairs the cost function value was found to be higher, i.e. for our set 0.930.93. This shows a high statistic fluctuation with the randomly chosen set, meaning the general problem is not (yet) fully learnt. As a further test we can interchange output with the input, this gave a cost function of 0.300.30 with a standard deviation of 0.260.26. We also tried random inputs, which gave in general very low cost function values. Consequently, the net is indeed learning some features of the training set, which also applies to an arbitrary set.

In Fig. 5 we visualize how well the quantum properties are learnt per se. The first graphs correspond to an early time in the optimization process, here the cost function gave the value 0.840.84 with a standard deviation of 0.090.09. We see that in the optimization process the net learns e.g. the symmetry between the |01|01\rangle and |10|10\rangle states (Fig. 5(b)), but the range of the errors does not get smaller when compared to a later stage in the optimization. From Fig. 5(a) we can deduce that the error in purity is significantly reduced (having only pure states in the training), but the system also predicts values greater than 11, which is of course unphysical. This could be compensated by adding a Lagrange multiplier to the cost function. In general we observe that the training and validation pairs distribute quite similarly. The range in the error of the concurrence (Fig. 5(a)), however, is not reduced.

The correction terms obtained by back propagation are always of similar size similar to the trivial example discussed in the previous section and have been visualized in Fig. 6. The dependence on the learning parameter ε\varepsilon is depicted in Fig. 7. In conclusion, the net learns partial properties of the set but the cost function does converge slowly. In the next section we discuss if the learning exploits the limits by the Heisenberg uncertainty relation.

(a)Refer to caption (b)Refer to caption

Figure 5: The pictures show the differences between the output of the net and the desired output for an early stage of optimization and at the last round for a (noise free) measurement of (a) 0000 and 1111 and (b) 0101 and 1010. Blue dots represent the 7070 training pairs and green dots the 3030 validation pairs.
Refer to caption
Figure 6: This figure shows the 6464 corrections of each λx,y\lambda_{x,y} forming one unitary for different optimization rounds. After some optimization steps we see oscillations but no longer differences in the sizes of the corrections in average.

IV Conclusion & Discussion

In this contribution we have analysed a minimal deep quantum neural network, i.e. a net taking as an input a two–qubit state and producing a two–qubit state as an output, with one hidden layer of two qubits in between. For that we performed two case studies. In the first case, each training pair is connected by one particular randomly chosen unitary matrix. In the second example each pair is connected by a unitary that allows to deduce from the output states the concurrence, a measure of entanglement, and the purity of the input state. Hence here we ask whether the net also learns those implicit properties of the input state, which is obviously classically impossible. In both cases at each round of optimization the cost function was always strictly increasing but typically not by a huge amount. Consequently, some learning of the net has always been observed.

The unitaries involved in the net have been parameterized in a composite way, which allows a quantum information theoretic view into the working of the net. In particular it shows that the corrections to such parameterized unitaries are of the form of a Heisenberg uncertainty relation, Eq. (21). One striking feature of the Quantum Nature of our world is that two non-commuting observables lead in general to a universal limit by Nature on the measurement outcomes. The most famous example is the uncertainty in momentum and position, (Δx)ψ(Δp)ψ2(\Delta x)_{\psi}(\Delta p)_{\psi}\geq\frac{\hbar}{2}. The fact that the lower bound, the universal limit by Nature, is independent of the state is a special property of those two observables. In general one obtains the quantity 12|ψ|[A^,B^]|ψ|\frac{1}{2}\left|\langle\psi|[\hat{A},\hat{B}]\right|\psi\rangle|. This is the Robertson form [8] of the Heisenberg uncertainty relation and was criticized, because by choosing an appropriate state ψ\psi it can vanish even if the two observables A^,B^\hat{A},\hat{B} do not commute.

Furthermore, it was shown that there exists an information theoretic formulation of uncertainty principle [10], which does not suffer from this problem of state dependence. This puts a limit on the extent to which the two observables can be simultaneously peaked. This entropic uncertainty relation of two non-degenerate observables is given by (introduced by D. Deutsch [10], improved in Ref. [11] and proven in Ref. [12])

S(A^)+S(B^)\displaystyle S(\hat{A})+S(\hat{B}) \displaystyle\geq logd(maxi,j{|χai|χbj|2})\displaystyle-\log_{d}\left(\max_{i,j}\{\left|\langle\chi_{a}^{i}|\chi_{b}^{j}\rangle\right|^{2}\}\right) (42)

where

S(A^)=apalogdpa\displaystyle S(\hat{A})=-\sum_{a}p_{a}\log_{d}p_{a} (43)

is the entropy for e.g. a certain prepared pure state ψ\psi and the pap_{a} is the probability associated with the measurement of outcome aa of A^a\hat{A}_{a} for ψ\psi, hence pa=|χa|ψ|2p_{a}=|\langle\chi_{a}|\psi\rangle|^{2}. Thus in general there is a universal limit to any two observables if they are non-commuting.

Coming back to our quantum neural network. We observed that those lower bounds never vanish, not even for a single generator Y~\tilde{Y}. From that we can conclude that the net does not optimize the unitaries involved such that all or some commutators vanish. Consequently, we can conjecture that the universal limit is not exploited in the optimization. Rather the fact that the parameters are oscillating shows the similarity to classical neural networks optimization. From that we infer that the optimization process does not exploit a particular quantum phenomenon.

In summary, those preliminary results have to be taken with care since we only used a minimal version of a net, e.g. no deeper nets, only one example of a gradient descent-based optimization and a limited set of problems. Moreover, there are several more techniques that could be applied to optimize the learning process. Utilizing a gradient descent-based optimization our findings are also strongly correlated to other works, e.g. Refs. [14, 15, 16], discussing e.g. barren plateau landscapes and how to avoid them. Further detailed studies are necessary to confirm these findings. However, for this minimal setting discussed here Heisenberg’s uncertainty relation is not a guiding principle.

(a)Refer to caption (b)Refer to caption

Figure 7: The curves show the cost function CC varied with the learning parameter ε\varepsilon for a typical step in the optimization for (a) the training (3030 states) and (b) the validation set (7070 states). If all corrections to the four unitaries are included, the maximum is quite pronounced and with that the choice of the optimal ε\varepsilon for the problem. Varying only one unitary the maximum is less pronounced, i.e. the problem is less constrained except for one of the four unitaries. In contrast to the training set (a) the validation set (b) does not give the maximum in the parameter space and a considerably lower value of the cost function as expected. However, the parameter–space for the learning parameter ε\varepsilon is quite similar to the training set.
Acknowledgements.
BCH thanks the organizers of the workshop “International Workshop on Machine Learning and Quantum Computing, Applications in Medicine and Physics (WMLQ2022)” for putting together an inspiring and at the top of the knowledge programme and a vivid environment for discussions. BCH also acknowledges gratefully that this research was funded in whole, or in part, by the Austrian Science Fund (FWF) project P36102.

References

  • [1] H. Behrends, D. Millinger, W. Weihs-Sedivy, A. Javornik, G. Roolfs and St. Geißendörfer. Analysis Of Residual Current Flows In Inverter Based Energy Systems Using Machine Learning Approaches, Energies 15, 582 (2022).
  • [2] LHCb collaboration, A new algorithm for identifying the flavour of Bs0B^{0}_{s} mesons at LHCb, Journal of Instrumentation 11, 05010 (2016).
  • [3] G. Angloher et al., Towards an automated data cleaning with deep learning in CRESST, https://doi.org/10.48550/arXiv.2211.00564.
  • [4] F. Leymann and J. Barzen, The Bitter Truth About Quantum Algorithms in the NISQ Era, Quantum Sci. Technol. 5, 044007 (2020).
  • [5] Ch. Spengler, M. Huber and B.C. Hiesmayr, Composite parameterization and Haar measure for all unitary and special unitary groups, J. Math. Phys. 53, 013501 (2012).
  • [6] K. Beer, D. Bondarenko, T. Farrelly, T. J. Osborne, R. Salzmann, D. Scheiermann, and R. Wolf, Training deep quantum neural networks, Nature Communications 11, 808 (2020).
  • [7] S. Hill and W. K. Wootters, Entanglement of a Pair of Quantum Bits, Phys.Rev.Lett.785, 022 (1997).
  • [8] H.P. Robertson, The Uncertainty Principle, Phys. Rev. 34, 163 (1929).
  • [9] I. Bialynici-Birula and L. Rudnicki, Entropic Relation in Quantum Physics, Statistical Complexity, Ed. K. D. Sen, Springer, 2011, Ch. 1, https://arxiv.org/abs/1001.4668.
  • [10] D. Deutsch, Uncertainty in quantum measurements, Phys. Rev. Lett. 50, 631 (1983).
  • [11] K. Kraus, Complementary observables and uncertainty relations, Phys. Rev. D 35, 3070 (1987).
  • [12] H. Maassen and J.B.M. Uffink, Generalized Entropy Uncertainty Relation, Phys. Rev. Lett. 60, 1103 (1988).
  • [13] S. Wilkinson and M. Hartmann, Evaluating the performance of sigmoid quantum perceptrons in quantum neural networks, quant-ph/2208.06198v1, https://doi.org/10.48550/arXiv.2208.06198 (2022).
  • [14] A. Kulshrestha and I. Safro, Avoiding Barren Plateaus in Variational Quantum Algorithms, arXiv.2204.13751, https://doi.org/10.48550/arXiv.2204.13751
  • [15] J.R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush and H. Neven, Barren plateaus in quantum neural network training landscapes, Nature Communication 9, 4812 (2018).
  • [16] D. Heimann, G. Schönhoff, F. Kirchner, Learning capability of parametrized quantum circuits, arXiv:2209.10345, https://doi.org/10.48550/arXiv.2209.10345 (2022).