Generalized quantum similarity learning

Santosh Kumar Radha [email protected] Casey Jao [email protected] Agnostiq Inc. 325 Front St W, Toronto, ON M5V 2Y1

Abstract

The similarity between objects is significant in a broad range of areas. While it can be measured using off-the-shelf distance functions, they may fail to capture the inherent meaning of similarity, which tends to depend on the underlying data and task. Moreover, conventional distance functions limit the space of similarity measures to be symmetric and do not directly allow comparing objects from different spaces. We propose using quantum networks (GQSim) for learning task-dependent (a)symmetric similarity between data that need not have the same dimensionality. We analyze the properties of such similarity function analytically (for a simple case) and numerically (for a complex case) and show that these similarity measures can extract salient features of the data. We also demonstrate that the similarity measure derived using this technique is $(\epsilon,\gamma,\tau)$ -good, resulting in theoretically guaranteed performance. Finally, we conclude by applying this technique for three relevant applications - Classification, Graph Completion, Generative modeling.

I Introduction

Notions of (dis)similarity are fundamental to learning. For example, they are implicit in probabilistic models that use dissimilarity computations to derive model parameters. In contrast, $k-$ nearest neighbor (KNN) or Support Vector Machines (SVM) methods explicitly find training instances similar to the input. Notions of similarity also play a fundamental role in human learning. It is well known that people can perceive different degrees of similarity. The semantics of such judgments may depend on both the task at hand and the particulars of data. Since manually tuning such similarity functions can be difficult and tedious for real-world problems, the notion of automatically learning task/data-specific similarity ( $\mathcal{S}$ ) from labeled data have been introduced [1, 2, 3, 4, 5]. In general, these learning methods are based on the intuition that a good similarity function should assign a large/small score to a pair of points in the same/different classes.

The most common way to model similarity is to use a distance metric $d$ as a model for the desired similarity measure. By definition, $d$ obeys the properties $d(x,x)=0$ , $d(x_{1},x_{2})=d(x_{2},x_{1})\geq 0$ , $d(x_{1},x_{2})+d(x_{2},x_{3})\geq d(x_{1},x_{3})$ , where $x\in\mathcal{X}$ with $\mathcal{X}$ the data space. An important effect of this is the transitive nature of the similarity function i.e. $\mathcal{S}$ derived from $d$ obeys the following, if $\mathcal{S}(x_{1},x_{2})=1$ and $\mathcal{S}(x_{2},x_{3})=1$ , then $\mathcal{S}(x_{1},x_{3})=1$ . The class of similarity maps attainable by these methods is just a subset of general similarity maps.

Another major technique used in classical machine learning is manifold-learning. This method aims to learn a low-dimensional structure in data under the assumption that the data lie on a (possibly non-linear) manifold. Examples of such methods include[6] local linear embedding[7], multidimensional scaling[8], Laplacian Eigenmaps[9] etc. These methods, by nature, produce only symmetric similarity measures i.e. $\mathcal{S}(x_{1},x_{2})=\mathcal{S}(x_{2},x_{1})$ .

Refer to caption — Figure 1: Learning relations between data in different spaces

Meanwhile, much work has been done to marry quantum computing and classical machine learning techniques to construct practical quantum algorithms compatible with the Noisy Intermediate-Scale Quantum (NISQ) era. One approach is to use parametrized quantum circuits (PQC) to define a hypothesized class of functions that potentially might offer access to a class of models beyond what is possible with classical models. Such quantum learning/variational algorithm have been used for various applications[10, 11, 12, 13, radha2021quantumwick]. Theoretical works have been successful in showing such quantum models exist for specific synthetic problems[14, 15, 16]. Specific classes of such quantum machine learning (QML) algorithms can be mathematically connected to their classical counterpart - kernel methods[17, 18]. In these methods, the learning is based on the kernels that are calculated between two given data points after mapping them to a latent space. This kernel can be made of PQCs, which, when trained, define distances on the latent space[19].

In this paper, we show that a general framework of such quantum embedding maps expands the class of algorithms from distance metric learning to (a)symmetric-similarity learning ones, which we call GQSim. This general class of QML algorithms has striking similarities with the classical similarity learning methods, including Siamese networks[20]. We show that these extensions allow for more general properties like intransitivity, asymmetry, etc., enabling us to learn a larger class of similarity maps than just distance-based similarity. Finally, one advantage of this technique is the ability to compare objects belonging to possibly different spaces, as shown schematically in Figure 1. Formally, given $\tilde{x}\in\mathbb{R}^{\tilde{n}}$ and $x\in\mathbb{R}^{n}$ data instances from two different sources/base space, and set of triplets $(\tilde{x}_{i},x_{i},y_{i})$ where $y_{i}:=\mathcal{S}(\tilde{x}_{i},x_{i})$ , the task is to predict $y_{j}$ for a pair of unseen instances $(\tilde{x}_{j},x_{j})$ with certain non trivial properties for $\mathcal{S}$ (like asymmetry). Note that even though we have $y_{i}\in\mathbb{R}_{+}$ , the same technique can easily be extended to non-regression type problems by having $y_{i}\in\{0,1\}$ . Such learning applications abound in nature; for instance, learning the sentence similarity between two different languages would generally be asymmetric and involve two different spaces.

We start by formulating the problem at hand and its underlying theory in section II. In section III, we first discuss analytically various details of the GQSim methods, including the effect of partial measurement. We illustrate this by solving a toy-problem analytically. Second, in subsection III.2 we pick a numerical problem of learning similarity between two different subspaces made up of synthetic images and points in $\mathbb{R}^{2}$ to demonstrate the generalizability of the learned model. section IV discusses similarity learning in various applied settings, and finally, we summarize our findings in section V.

II Formalism

For sets $X$ and $\tilde{X}$ , assume we are given set of elements $A\subset X\times\tilde{X}$ and the corresponding set of similarity measure $Y$ where $Y:=\mathcal{S}(x,\tilde{x})$ for all $(x,\tilde{x})\in A$ with $Y\in\mathbb{R}_{+}$ . Our goal is to learn the model $\mathcal{S}$ that is used to predict the similarity between unseen elements in $X\times\tilde{X}$ . We first start by generalizing metric learning for multi-subspace setting after which we introduce the framework for learning asymmetric similarity. Even though $\mathcal{S}\in\mathbb{R}_{+}$ , for the sake of discussion, we will restrict ourselves to $\mathcal{S}$ being binary values of either similar or dissimilar. Learning of $\mathcal{S}$ is done by using parameterized quantum circuits, details of which are discussed in subsubsection III.2.1.

II.1 Multi-subspace metric learning

Given $\mathcal{E}_{l},\mathcal{D}_{k}\subset X\times\tilde{X}$ with cardinality $l,k$ such that pairs $(x_{i},\tilde{x}_{i})\in\mathcal{E}_{l}$ are deemed related ( $y_{i}=1$ ) and $(x_{i},\tilde{x}_{i})\in\mathcal{D}_{k}$ are unrelated ( $y_{i}=0$ ). Then we try to learn a distance function

d:X\times\tilde{X}\to\mathbb{R}_{+}

such that

\displaystyle\begin{cases}d(x,\tilde{x})\leq\delta&(x,\tilde{x})\in\mathcal{E}_{l}\\ d(x,\tilde{x})\geq\varepsilon^{-1}&(x,\tilde{x})\in\mathcal{D}_{k}\end{cases}

(1)

for some small constants $\delta,\varepsilon>0$ . Here, $d$ need not satisfy the usual axioms of a distance metric, such as symmetry, which may not even make sense as the inputs come from different spaces. The problem can also be phrased in terms of a similarity measure $\mathcal{S}$ , which simply reverses the inequalities: $\mathcal{S}(x,\tilde{x})$ should be large when $x$ and $\tilde{x}$ are similar and small when they are dissimilar.

We consider $d$ of the form

\displaystyle d(x,\tilde{x}):=\|f_{\theta}(x)-g_{\eta}(\tilde{x})\|_{\mathcal{H}}.

(2)

where the feature maps $f_{\theta}:X\to\mathcal{H}$ , $g_{\eta}:\tilde{X}\to\mathcal{H}$ encode the raw input data in a common Hilbert space $\mathcal{H}$ . Concretely, let $H=(\mathbb{C}^{2})^{\otimes n}$ be the Hilbert space of a quantum computer with $n$ qubits, and let $\mathcal{H}=\mathcal{L}(H)$ be the space of complex-linear operators on $H$ equipped with the Hilbert-Schmidt inner product. Following [21], define the feature maps

\displaystyle\begin{split}f_{\theta}(x)&=U_{\theta}(x)\outerproduct{0^{n}}{0^{n}}U_{\theta}(x)^{\dagger}\\ g_{\eta}(\tilde{x})&=V_{\eta}(\tilde{x})\outerproduct{0^{n}}{0^{n}}V_{\beta}(\tilde{x})^{\dagger}\end{split}

(3)

where $U_{\theta}(x)$ , $V_{\eta}(\tilde{x})$ are parameterized quantum circuits. Then

\displaystyle d(x,\tilde{x})^{2}=2-\mathcal{S}_{\theta,\eta}(x,\tilde{x})

(4)

where the similarity measure

\displaystyle\begin{split}\mathcal{S}_{\theta,\eta}(x,\tilde{x})&=\Tr{f_{\theta}(x)^{\dagger}g_{\eta}(\tilde{x})}\\ &=\absolutevalue{\expectationvalue{U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})}{0^{\otimes n}}_{H}}^{2}\end{split}

(5)

is nothing but the overlap between states $U_{\theta}(x)\ket{0^{n}}$ and $V_{\eta}(\tilde{x})\ket{0^{n}}$ . There are multiple methods to efficiently compute Equation 5 using a quantum device. It can be measured experimentally by first preparing the state $U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})\ket{0^{n}}$ and then computing the probability that all $n$ qubits measure to $0$ as shown in Figure 2(b). Alternatively, one can prepare the states $U_{\theta}(x)\ket{0^{n}}$ , $V_{\eta}(\tilde{x})\ket{0^{n}}$ in two separate $n$ -qubit registers and performing a SWAP test with an ancilla qubit. The latter scheme trades circuit depth for width.

II.2 Generalized similarity learning

In the previous scenario, we formulated the picture where data in each individual space is mapped by a unique embedding to a Hilbert space. The quantum feature maps are hitherto defined by applying unitaries to a standard $n$ -qubit state $\outerproduct{0^{n}}{0^{n}}$ . We will explore two variations of Equation 5, which by slightly tweaking, generalizes the previous setting.

Consider first a simple variant of the SWAP test method. Recall that in this setup, the feature maps $f_{\theta}(\cdot)$ , $g_{\eta}(\cdot)$ act on separate registers of $n$ qubits each. For $m\leq n$ , perform the SWAP test only on the first $m$ qubits of each register to obtain

\displaystyle\mathcal{S}_{\theta,\eta}^{m}(x,\tilde{x}):=\Tr[f_{\theta,m}(x)^{\dagger}g_{\eta,m}(\tilde{x})\bigr{]},

(6)

where the (mixed) states

	$\displaystyle f_{\theta,m}(x)=\Tr_{n-m}\bigl{[}{f_{\theta}(x)}\bigr{]}$		(7)
	$\displaystyle\ g_{\eta,m}(\tilde{x})=\Tr_{n-m}\bigl{[}g_{\eta}(\tilde{x})\bigr{]}$		(8)

are the partial trace of the original density matrices over the last $n-m$ qubits, such a circuit is shown in Figure 2(d). When $m<n$ , $f_{\theta,m}$ encodes classical data into an $m$ -qubit system by applying a possibly non-unitary CPTP map to the initial state $\outerproduct{0^{m}}{0^{m}}$ .

The second tweak comes by slightly changing modifying how Equation 5 is measured without SWAP test. An alternative for calculating expectation value, as mentioned previously, is to apply $U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})$ to $\ket{0^{n}}$ and measure the probability that all $n$ qubits return $0$ . Instead, we will now only inspect the first $m$ qubits to obtain another modified similarity

	$\displaystyle\tilde{\mathcal{S}}_{\theta,\eta}^{m}(x,\tilde{x})$	$\displaystyle=\sum_{i\in\{0,1\}^{n-m}}\absolutevalue{\innerproduct{0^{m}i}{\phi_{x,\tilde{x}}^{\theta,\eta}}}^{2}$		(9)
		$\displaystyle=\Tr{\left\{\outerproduct{0^{m}}{0^{m}}\left(\Tr_{n-m}\left[\ket{\phi^{\theta,\eta}_{x,\tilde{x}}}\bra{\phi^{\theta,\eta}_{x,\tilde{x}}}\right]\right)\right\}}$		(10)

where $\ket{\phi^{\theta,\eta}_{x,\tilde{x}}}=U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})\ket{0^{n}}$ . This circuit is shown in Figure 2(c).

One way to think of this quantity is to regard the correspondence

(x,\tilde{x})\mapsto U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})\ket{0^{n}}

as an embedding of pairs $(x,\tilde{x})$ . One might more suggestively write $U_{\theta}(x)^{\dagger}V_{\eta}(\tilde{x})=\tilde{U}_{\alpha}(x,\tilde{x})$ for some effective $\tilde{U}$ as shown in Figure 3. Ideally, all similar pairs $(x,\tilde{x})\in\mathcal{E}_{l}$ should align perfectly with $\ket{0^{n}}$ and dissimilar pairs $(x,\tilde{x})\in\mathcal{D}_{k}$ should be orthogonal to $\ket{0^{n}}$ . Here, we do not lose generality by mapping it to $\ket{0^{n}}$ as any other arbitrary state can be mapped back to $\ket{0^{n}}$ by a unitary which then can be absorbed into the embedding parameterized unitary. Measuring only some of the qubits effectively takes a partial trace over the remaining ones. Note that unlike the previous formulation, $\tilde{\mathcal{S}}^{m}_{\theta,\eta}$ is need not be symmetric in $x$ and $\tilde{x}$ , even when $X=\tilde{X}$ .

II.3 Goodness of similarity learning

It is crucial to understand what it means for a pairwise functional maps $\mathcal{S}$ to be a “good similarity function” for a given learning problem.

Balcan et al. [22] proposed a new learning theory for similarity functions. This framework aims to generalize the learning techniques of kernels by relaxing some of the constraints we are currently interested in. They give intuitive and sufficient conditions for a similarity function to guarantee performance. Essentially, a similarity function ( $\mathcal{S}$ ) is $(\epsilon,\gamma,\tau)$ -good if a $1-\epsilon$ proportion of examples are on average $2\gamma$ more similar to examples of the same class than to examples of the opposite class for a $\tau$ proportion of examples. This is explicitly proved for a general $\mathcal{S}$ which need not be a metric that satisfies positive semi-definiteness or symmetry. Given a $(\epsilon,\gamma,\tau)$ -good $\mathcal{S}$ , it can be shown that $\mathcal{S}$ can be used to build a linear separator in an explicit projection space that has a margin $\gamma$ and error arbitrarily close to $\epsilon$ .

Definition II.1 (Balcan et al. [22])

A similarity function ( $\mathcal{S}$ ) is $(\epsilon,\gamma,\tau)$ -good function for a learning problem $P$ if there exists a (random) indicator function $\mathbf{R}(\mathbf{x})$ defining a (probabilistic) set of points such that the following conditions hold:

A $1-\epsilon$ probability mass of examples $(x,y(\mathbf{x}))$ satisfy

\displaystyle\mathbb{E}_{\left(\mathbf{x}^{\prime},y(\mathbf{x}^{\prime})\right)\sim P}\left[y(\mathbf{x})y(\mathbf{x}^{\prime})\mathcal{S}\left(\mathbf{x},\mathbf{x}^{\prime}\right)\right]\geq\gamma

2.

$\operatorname{Pr}_{\mathbf{x}^{\prime}}\left(R\left(\mathbf{x}^{\prime}\right)\right)\geq\tau$ .

Formally defined in Definition II.1, it can also be shown that the set of similarity maps defined by this is strictly larger than the set of metric maps and also includes non-positive semi-definite kernels and asymmetric maps. With the above definition, Ref[22] proved the following theorem

Theorem II.1 (Balcan et al. [22])

Let $\mathcal{S}$ be a $(\epsilon,\gamma,\tau)$ -good similarity function for a learning problem for a learning problem $P$ . Let $\mathcal{L}=\left\{\mathbf{x}_{1}^{\prime},\ldots,\mathbf{x}_{d}^{\prime}\right\}$ be a (potentially unlabeled) sample of $d=\frac{2}{\tau}\left(\log(2/\delta)+8\frac{\log(2/\delta)}{\gamma^{2}}\right)$ landmarks drawn from $P$ . Consider the mapping $\phi^{\mathcal{L}}:\mathcal{X}\rightarrow\mathbb{R}^{d}$ defined as following

\displaystyle\phi^{\mathcal{L}}(\mathbf{x})=\mathcal{S}\left(\mathbf{x},\mathbf{x}_{i}^{\prime}\right),i\in\{1,\ldots,d\}.

Then, with probability $1-\delta$ over the random sample $\mathcal{L}$ the induced distribution $\phi^{\mathcal{L}}(P)$ in $\mathbb{R}^{d}$ has a separator error at most $\epsilon+\delta$ relative to $L_{1}$ margin at least $\gamma/2$ .

Theorem II.1 states that, if enough data is available for the problem $P$ , then for a $(\epsilon,\gamma,\tau)$ -good similarity function, with high probability there exists a low-error (arbitrarily close to $\epsilon$ ) linear separator in mapped $\phi$ -space. This framework of $(\epsilon,\gamma,\tau)$ -good functions opens the door for us to evaluate the performance that one can expect a global linear separator to have, depending on how well a similarity function satisfies Definition II.1. We will see in subsection III.2 that this method, at least numerically, can separate the classes as needed by definition.

III Discussion

We will start with a qualitative discussion on similarity learning methods introduced in subsection II.1, following which we solve a toy embedding explicitly to understand the intricacies of these methods. Next, we explore a more complex example of learning the similarity between a synthetic image set and abstract 2D space, and numerically show that this learning method can learn salient encoded features of the images.

III.1 Analytical Discussion

By taking $X=\tilde{X}$ in metric learning setting, we reduce to the framework of classical metric learning where the sought-after $d$ is assumed to be a distance (pseudo-)metric in the usual sense. Thus, $d$ should satisfy the following properties $d(x_{1},x_{2})\geq 0$ with equality when $x_{1}=x_{2}$ ; $d(x_{1},x_{2})=d(x_{2},x_{1})$ , and $d(x_{1},x_{3})\leq d(x_{1},x_{2})+d(x_{2},x_{3})$ for all triples $x_{1},x_{2},x_{3}\in X$ . One way to enforce these constraints is to seek a distance function of the form

d_{\theta}(x_{1},x_{2})=\|f_{\theta}(x_{1})-f_{\theta}(x_{2})\|_{Z}

where $f_{\theta}$ is some neural network mapping $X$ into some latent Euclidean space $Z$ . This construction, depicted in Figure 4, is called a Siamese neural network [20]. By taking $f_{\theta}(x)=\outerproduct{\phi_{\theta}(x)}{\phi_{\theta}(x)}$ as a quantum feature map as in the previous subsection, we have

\displaystyle d_{\theta}(x_{1},x_{2})^{2}=2-2K_{\theta}(x_{1},x_{2})

(11)

where $K_{\theta}(x_{1},x_{2})=\absolutevalue{\innerproduct{\phi_{\theta}(x_{1})}{\phi_{\theta}(x_{2})}}^{2}$ is a parameterized quantum kernel[23, 24]. As noted in Ref.[24, Appendix A], the training criterion Equation 1 implicitly guides the latent space embeddings $f_{\theta}$ to separate dissimilar pairs while compressing similar pairs together.

Next, in case of the similarity $\mathcal{S}^{m}$ , as hinted in subsection II.2, when $m<n$ , $f_{\theta,m}$ is a non-unitary CPTP map. This thus enriches the family of feature maps at our disposal compared to the previous case. Intuitively this can be understood as following, we prepare two quantum states that map classical data $x$ and $\tilde{x}$ to their respective states $\ket{x},\ket{\tilde{x}}$ . In the previous multi-space metric learning case, we required that (dis)similar elements be mapped (far)close to each other in the Hilbert space $\mathcal{H}^{n}$ with dimension $n$ . In contrast, similarity defined in Equation 6 requires that dis)similar gets mapped (far)close to each other in the Hilbert space $\mathcal{H}^{m}\subset\mathcal{H}^{n}$ . This is similar to the projective SWAP test used in Ref[25].

We will now offer some heuristics for the more general similarity measure $\mathcal{\tilde{S}}^{m}$ . It is easier to cluster points together in lower-dimensional spaces, which may help when similar raw data are far apart in their ”native” spaces $X,\tilde{X}$ . The trade-off is that separating points becomes hard. The number of qubits measured changes the relative difficulties of satisfying the training constraints.

Precisely, suppose one measures $m$ of the $n$ qubits. Then if $(x,\tilde{x})\in\mathcal{E}_{l}$ , the possible embeddings $\ket{\phi^{\theta,\eta}_{x,\tilde{x}}}$ that satisfy the constraint $\mathcal{S}^{m}(x,\tilde{x})=1$ live in a subspace of dimension $d_{\mathcal{E}_{l}}=2^{n-m}$ . This space is spanned by all basis states $\{\ket{0i},\ i\in\{0,1\}^{n-m}\}$ . On the other hand, if $(x,\tilde{x})\in\mathcal{D}_{k}$ , the states $\ket{\phi^{\theta,\eta}_{x,\tilde{x}}}$ that satisfy $\mathcal{S}^{m}(x,\tilde{x})=0$ constitute a subspace of dimension $d_{\mathcal{D}_{k}}=2^{n}-2^{n-m}$ . The ratio $\zeta=\frac{d_{\mathcal{E}_{l}}}{d_{\mathcal{D}_{k}}}=\frac{1}{2^{m}-1}$ describes the relative difficulties of the training constraints. In the scenario where $m=1$ , we have $\zeta=1$ , i.e we provide equal volumetric space in the model to place similar and dissimilar data points. In contrast, when $m=n$ , $\zeta=\frac{1}{2^{n}-1}$ where we provide larger space for dissimilar points to live. This gives us the ability to tune the model space we are working with based on the data provided, for instance for an highly imbalanced data set (when $\absolutevalue{l-k}>>0$ ), one can choose $m$ based on if $\frac{l}{k}\approx\zeta$ .

We illustrate these considerations with a toy example. Consider the following simple two-qubit circuit – $U(x,\tilde{x})=R_{y}(x,1)CNOT(0,1)R_{x}(\tilde{x},1)$ where $R_{x}(a,b)$ and $R_{y}(a,b)$ are the Pauli $X$ and $Y$ rotations by angle $a$ on qubit $b$ – and the state $\ket{\phi^{\theta,\eta}_{x,\tilde{x}}}=\ket{\phi_{x,\tilde{x}}}=U(x,\tilde{x})\ket{0^{\otimes 2}}$ . $U$ can be interpreted as a embedding of $x,\tilde{x}$ in a Hilbert space for a particular choice of parameters. By traversing $0<x,\tilde{x}\leq 2\pi$ , we can trivially look at the entire embedding space accessible by the constant ansatz $U$ . We can thus look at the number of pairs of similar and dissimilar points that are accommodated by this ansatz based on the measure $\mathcal{S}^{m}$ for various $m$ (in this case $m\in\{1,2\}$ ), both of which is shown in Figure 5(a) as circuits. As shown in appendix A one gets

	$\displaystyle\mathcal{S}^{2}(x,\tilde{x})$	$\displaystyle=\cos^{2}{\left(\frac{x}{2}\right)}\cos^{2}{\left(\frac{\tilde{x}}{2}\right)},$		(12)
	$\displaystyle\mathcal{S}^{1}(x,\tilde{x})$	$\displaystyle=\frac{\cos{\left(x-\tilde{x}\right)}+\cos{\left(x+\tilde{x}\right)}}{4}+\frac{1}{2}.$		(13)

In Figure 5(a), we plot $\mathcal{S}^{1}$ and $\mathcal{S}^{2}$ for all values of $x,\tilde{x}$ . Intuitively, $\mathcal{S}$ ’s closeness to a value of ( $0$ ) $1$ indicates the paired points $x,\tilde{x}$ are (dis)similar to each other. We see that in the case of $\mathcal{S}^{2}$ we have exactly 4 points in the space that are maximally similar to each other. In contrast, a huge set of points with an almost flat region have a similarity measure of $0$ . Comparatively, in $\mathcal{S}^{1}$ , we have lifted a majority of this flat region to support more points that are similar to each other. To better understand this, we also look at the Density of States (DoS) of $\mathcal{S}$ for both the cases. The following defines this,

\displaystyle D(\mathcal{S}):=\int\frac{\mathrm{d}x\mathrm{d}\tilde{x}}{(2\pi)^{2}}\cdot\delta(\mathcal{S}-\mathcal{S}(x,\tilde{x})).

(14)

$D(\mathcal{S})$ essentially counts the number of points inside each fundamental unit of similarity. Figure 5(b) shows the DOS, where we see a peak close to $0$ , for $\mathcal{S}^{2}$ indicating that this measure allows for a high imbalance between similar and dissimilar points. In the case of $\mathcal{S}^{1}$ , as discussed previously, we have a perfectly even split of volume between similar and dissimilar points with mid (dis)similarity measure being $0.5$ .

Let us now place the above problem in the setting of retrieval problem to gauge the ability of the respective family of similarity measures. Given two subspaces $X,\tilde{X}\in(0,2\pi]$ , and $x_{s},x_{d}\in X$ , we strive to find some point $\tilde{X}$ that is most similar to $x_{s}$ but most dissimilar to $x_{d}$ . This problem can be formulated as minimizing

\displaystyle\min_{\tilde{x}}\mathcal{L}_{\mathcal{S}}(\tilde{x},x_{s},x_{d})

(15)

where

\displaystyle\mathcal{L}_{\mathcal{S}}(\tilde{x},x_{s},x_{d})=\frac{1}{2}\left[(1-\mathcal{S}(x_{s},\tilde{x}))^{2}+\mathcal{S}(x_{d},\tilde{x})^{2}\right]^{\frac{1}{2}},

(16)

for some similarity measure $\mathcal{S}$ and loss function $\mathcal{L}$ . $\mathcal{L}(\tilde{x},\cdot,\cdot)$ measures the loss of how well the similarity measure has performed. As a first illustration, we pick $x_{s}=0.3$ and $x_{d}=0.5$ and then calculate $\mathcal{L}(\tilde{x},x_{s}=0.3,x_{d}=0.5)$ using the same embedding $U$ . This is shown in Figure 6(a), where the dotted lines use the measure $\mathcal{S}^{m=2}$ while the red line uses $\mathcal{S}^{m=1}$ . Since the quantum embedding map is the same for both the cases, they both have the same optimal $\tilde{x}\approx\frac{\pi}{2}$ . We see that the loss function, which can be used as a surrogate to quantify how good the similarity measure is in separating the optimal $\tilde{x}$ , is much lower/deeper in the case of $m=1$ . Thus partial measurement of $m=1$ is able to attain a better separation than $m=2$ . To complete the analysis, we calculate the quantity

\displaystyle\lambda(x_{s},x_{d})=\mathcal{L}_{\mathcal{S}^{1}}(\tilde{x}_{x_{s},x_{d}}^{*},x_{s},x_{d})-\mathcal{L}_{\mathcal{S}^{2}}(\tilde{x}_{x_{s},x_{d}}^{*},x_{s},x_{d}),

(17)

where $\tilde{x}_{x_{s},x_{d}}^{*}$ is the optimal $\tilde{x}$ for the pair $(x_{s},x_{d})$ . $\lambda(x_{s},x_{d})$ quantifies the (dis)advantage one gets by using $\mathcal{S}^{1}$ instead of $\mathcal{S}^{2}$ . Figure 6(b) plots $\lambda$ , where one sees that at worst case, we do not get any larger separation/better performance using partial measurement. But there are cases (as seen in Figure 6(a) and indicated by non zero value in (b)), where the (dis)similarity of $\tilde{x}$ is better in partial measurement.

III.2 Numerical Discussion

III.2.1 Experimental setup

Unless otherwise stated, the following setup is adopted throughout the numerical experiments. We use the QAOA embedding[19], which typically first encodes each input feature in the angle of a separate $R_{X}$ gate, then creates entanglement using a combination of parameterized one-qubit $R_{Y}$ and two-qubit $RZZ$ rotation. We use layers of size $2$ and qubits count of $4$ . We use the same basic ansatz to embed both features but allow the underlying parameters to vary independently for each feature; in other words, in (3) we take $V_{\eta}=U_{\eta}$ . Our training data comes in the form of subsets $\mathcal{E}_{l},\mathcal{D}_{k}\subset X\times\tilde{X}$ of similar and dissimilar pairs, respectively. By construction, our similarity function $\mathcal{S}_{\theta,\eta}(x,\tilde{x})$ takes values in the interval $[0,1]$ . We wish to find parameters $\theta,\eta$ such that $\mathcal{S}_{\theta,\eta}$ ideally satisfies $\mathcal{S}_{\theta,\eta}=1$ on $\mathcal{E}_{l}$ and $\mathcal{S}_{\theta,\eta}=0$ on $\mathcal{D}_{k}$ . We seek to minimize the cost function

\displaystyle L(\theta,\eta)=\sum_{(x,\tilde{x})}(\mathcal{S}_{\theta,\eta}(x,\tilde{x})-y_{x,\tilde{x}})^{2},

(18)

where

\displaystyle y_{x,\tilde{x}}=\begin{cases}1,&(x,\tilde{x})\in\mathcal{E}_{l}\\ 0,&(x,\tilde{x})\in\mathcal{D}_{k}.\end{cases}

Whereas most machine learning literature searches for the optimal parameters using some form of stochastic gradient descent, we employ the COBYLA optimizer [26]. This is a gradient-free method designed for noisy cost landscapes. Since it is expensive to evaluate the loss Equation 18 over the entire dataset, whenever it is required to evaluate the loss, we approximate it by a sum over a randomly chosen subset of the terms, normalized by the batch size of $80$ , justification for this choice is given in Appendix B.

III.2.2 Experiment/Discussion

Image data

: To better understand the capabilities of these learning models, we here show a full working flow of the algorithm. We synthetically generate $N=100$ $2\times 2$ images as shown in Figure 7(a) where the pixel values of images are filled from a uniform random distribution between $[0,1]$ . We split the image data into two sets where we reset the right half of the pixels to $0$ for one set and the left half to $0$ in the other set. We label these two subsets as “right” and “left”. This now forms the $X-$ space of our data. For $\tilde{X}$ -space, we generate $N=100$ points clustered into 2 sections (shown as red and blue in (a)) in $\mathbb{R}^{2}$ . For the training data, we consider “left”(“right”) images in $X$ to be similar to red(blue) points in $\tilde{X}$ . Using this, we then generate our training points made up of $(x,\tilde{x},y)$ for all points $x\in X$ and $\tilde{X}\in\tilde{X}$ with $y$ being ( $0$ ) $1$ for (dis)similar $(x,\tilde{x})$ pairs. We then start the learning process to find the optimal parameters ( $\theta^{*},\eta*$ ) that minimize the loss given in Equation 18. We use $\mathcal{S}^{m=2}$ as the similarity model.

Performance

: Once we have the optimal angles, there are multiple ways to test the performance of this model. First, we generate a new random “left” ( $x^{l}$ ) and “right” ( $x^{r}$ ) image and calculate the following four quantities - $\mathcal{S}(x^{l},\tilde{x}_{i}^{red})$ , $\mathcal{S}(x^{r},\tilde{x}_{i}^{red})$ , $\mathcal{S}(x^{l},\tilde{x}_{i}^{blue})$ , $\mathcal{S}(x^{r},\tilde{x}_{i}^{blue})$ where $\tilde{x}_{i}^{red/blue}$ is the $i^{th}$ red/blue point in $\tilde{X}$ -space that was used for training. We plot the histogram/density of these points in Figure 7(b), where we plot the similarity value of test image shown in inset with red points (shown as red line) and blue points (shown as blue line) for (b1) $x^{r}$ and (b2) $x^{l}$ . We see that in both cases, the network successfully separates the similarity values of how close the left/right image is to the blue/red points depicted by the peaks in each distribution. Second, we see that $x^{r}$ is closer (higher similarity value) to blue points than to red points while the case is reversed for the left value. This is consistent with how we expect the model to behave. This shows that the model has not only trivially learned what it means to be left image and what it means to be the right image, but how this feature associates to a different subspace - $\tilde{X}$ . This brings us to the discussion in subsection II.3, where we introduced the concepts of $(\epsilon,\gamma,\tau)$ -good similarity function. The figure shows that the average distance between samples in classes is well separated for their corresponding classes in $X$ space. Numerically we verified that this is not just true for a subset of points in $X$ , but for all points in $X$ .

Another gauge of the model performance is to look at where in the $\tilde{X}$ -space does our model place a given point in $X$ -space. To this end, we chose another random $x^{l/r}$ and calculated its similarity on a uniform grid space in $\tilde{X}$ . This is shown in Figure 7(c), where the color and size of each grid point corresponds to the learned similarity measure of the image in the inset. We see that the most similar points in $\tilde{X}$ for the left image (indicated by blue color) are around the lower left, while for the right, it is around the upper right corner. This is indicative of the initial mapping we learned from where the points in $\tilde{X}$ space form clusters in the left bottom, and top right parts of the space.

Generalizability of model

: To answer the question about the generalizability of the learned model, we now generate out-of-sample images given by the following $2\times 2$ pixel value

\displaystyle\begin{matrix}x_{\Delta}=\begin{array}[]{c|c}\mathit{X}_{\Delta}&1-\mathit{X}_{\Delta}\\ \hline\cr\mathit{X}_{\Delta}&1-\mathit{X}_{\Delta},\end{array}\end{matrix}

(19)

where $\mathit{X}_{\Delta}$ is a random variable generated from a truncated normal distribution $0\leq\mathcal{\tilde{N}}(\Delta,\sigma=\epsilon)\leq 1$ with mean $\Delta$ and variance $\sigma=\epsilon$ , where $\epsilon$ is a relatively small number (in our case we choose $\epsilon=0.1$ . Thus we have the two extremes $\Delta=0(1)$ , where $x_{\Delta}$ belongs to left(right) image groups while it interpolates between left and right as a function of $\Delta$ . This is shown in Figure 8(top).

We can quantify the network’s performance in this scenario by looking at the maximal separation between the similarity of $x_{\Delta}$ from the training data’s red and blue points. We do this by calculating the Wasserstein distance ( $d$ ) between the distributions formed by these similarities. For example, in Figure 7(b1), we calculate the Wasserstein distance between the red and blue distributions. We do this for $N=50$ different randomly generated $x_{\Delta}$ images for each $0\leq\Delta\leq 1$ . Figure 8(bottom) shows the result of such calculation with error bars indicating the variance of the $50$ runs. Intuitively $d(\Delta)$ measures the (in)distinguishably of $x_{\Delta}$ in $\tilde{X}$ space. Since in our mapping, left and right images are separable in $\tilde{X}$ space, we expect to see a high distance value at these extremes. As we approach $\Delta=0.5$ , the image becomes equally (dis)similar to both red and blue points in $\tilde{X}$ and hence the distance between the respective similarity distributions attain $0$ . This is shown in Figure 8(bottom). This is phonologically similar to the ordered to disordered phase transition detected using classical neural network in Ref[27]. Thus by just training the model with the extremities ( $\Delta=0/1$ ), the network has learned to detect higher-order features enabling us to detect unseen transitions. It shows that the final achieved similarity scores are discriminative not just because of local information about the images, rather a combination of both local and global properties.

IV Applications

Although similarity measures can be used for a wide range of applications, we choose three such applications to analyze and understand the technique. First, we show the trivial example of how similarity learning can be used for classification. Then, we illustrate a practical example of graph link completion problem using similarity learning. Finally, we show that ability to differentiate parameters in quantum circuits efficiently gives us the ability to use the learned model as a generative model.

IV.1 Classification

Classification tasks and the notion of similarity are deeply connected. Suppose we are given $m$ clusters of data

C_{1}=\{x_{1,i}\}_{i},\ C_{2}=\{x_{2,i}\}_{i},\dots,C_{m}=\{x_{m,i}\}_{i}\subset\mathbb{R}^{d}.

To classify a new data point $x\in\mathbb{R}^{d}$ , we want to determine the class “most similar” to $x$ - in other words, a fidelity classifier[23, Appendix C]. Here, maximizing the separation of embedded data from different classes amounts to minimizing the empirical risk of the classifier for a linear loss function. An alternative method, which we do not pursue here, is to use the trained kernel in a support vector machine classifier as proposed by Hubregtsen et al [24]. Instead, it suffices to compare $x$ with a representative sample from each class. While the usual distance in $\mathbb{R}^{d}$ provides a naive notion of similarity, distance measures tailored to the data may perform better if the classes are irregularly shaped. In Figure 9, we show our GQSim being used for $\mathcal{X}=\mathcal{\tilde{X}}=\mathbb{R}^{2}$ where (a) is multi-class classification and (b) is a more complex structure for similarity measure. In (a), we compare each point in the the space to single sample from each train class to do a one shot classification while in (b) we compute the similarity and normalize it to 1 for each point in space to get the probability of them being similar to yellow or purple family.

The image example discussed in subsection III.2 is a toy model of the following classical problem: given $m$ clusters of images and crude information about the clusters, one can use this information to encode the images in a different subspace $\tilde{X}$ . This information is termed side information, and it appears naturally in many applications. For example, when classifying images containing {house, dog, cat}, we know that even though they belong in discrete categories, the distance between categories dog-cat is closer than dog/cat-house. Another example of side information is categorizing the rating of a given movie in {1,2,3}. Even though each movie has a category, one has the information that the categories are ordered i.e $1<2<3$ . This information can be encoded in $\tilde{X}$ space by manually choosing clusters for $\{1,2,3\}\to\{c_{1},c_{2},c_{3}\}$ , say in $\mathbb{R}^{1}$ , such that center of $c_{1}<c_{2}<c_{3}$ . Thus, now learning the similarity between the movie space and $\tilde{X}$ space and using this similarity to categorize movies has more implied structure built into it than directly classifying them.

IV.2 Graph Completion

Link prediction[4] is an important aspect of network analysis and an area of key research. Graph completion problem[28] is a subset of link prediction, where it is assumed that only a small sample of a large network (e.g., a complete or partially observed sub-graph of a social graph) is observed, and one would like to infer the unobserved part of the network. To formalize the setting, we assume there is a true uni directed unweighted graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ on $n=|\mathcal{V}|$ distinguishable nodes with its own attributes $\mathbf{v}_{i}$ for $1\leq i\leq n$ with an adjacency matrix $\mathbf{A}\in\{0,1\}^{n\times n}$ where $0$ denotes an absent edge and $1$ denotes a connected edge. We then assume that only a partially observed adjacency sub-matrix $\mathbf{O}\in\{0,1\}^{m\times m}$ with $1\leq m\leq n$ induced by a sample sub-graph $\mathcal{G}^{\prime}=(\mathcal{V},\mathcal{E}^{\prime})$ with $\mathcal{E}^{\prime}\subset\mathcal{E}$ , of original graph is given. We are now interested in predicting the complete adjacency matrix $\mathbf{A}$ based on the partially observed sub-matrix $\mathbf{O}$ using the learned similarity measures of the node attributes $\mathbf{v}_{i}$ . For example, in a social community network, the nodes might represent the individuals, and the link of nodes represents the relations between the individuals. The attributes for each node could include the person’s metadata like interest, age, location, occupation, etc. The more similar attributes of the two nodes are, the higher probability they have a relationship and are linked in the network.

As an example, in Figure 10 we will associate attributes of each node with a $d=2$ dimensional data coming from $k=2$ unique distributions $P_{k}$ . Edges are then connected ( $1$ ) to nodes having attributes from the same distribution and not connected $(0)$ if their attributes correspond to a different distribution. This defines the real hidden graph $\mathbf{A}$ . After this, we randomly select $\approx 10\%$ of the edges to create the sub-graph $\mathbf{O}$ . This is then fed into the algorithm to learn the similarity between the nodes, which is then used to complete the graph. Since we have chosen an uni-directed graph $\mathcal{G}$ , connection between $(v_{i},v_{j})\in\mathcal{V}$ is going to be the same between $(v_{j},v_{i})$ . This symmetry is forced in the model by setting $f_{\theta}=g_{\eta}$ of the learning model, i.e. we choose to embed both comparing points with the same embedding. Clearly, for general directed multigraph, one needs to go beyond symmetry and have the property that $(v_{i},v_{j})\neq(v_{j},v_{i})$ . Moreover, in the case of a directed multigraph, one needs to allow for nontrivial $(v_{i},v_{i})$ to indicate loops in the graph potentially. For these cases, one may opt to use the generalized similarity learner to embed the points in different embedding and trace out parts of the Hilbert space.

The figure shows the graph nodes being plotted in a 2D space, with each node placed corresponding to its attribute. Red lines indicate the connected nodes, while dotted lines indicate no connection. Missing lines in the left figure correspond to the lack of information. The right figure shows the predicted complete matrix. As seen from the figure, the points from the same spatial cluster are connected while the nodes with attributes belonging to different $P_{k}$ are disconnected. In the experiments where we have $P_{k}$ as Gaussian with random spread $\in[0.5,1.5]$ , a reduced graph with edges as little as $1\%$ reproduces the final graph with $100\%$ accuracy.

IV.3 Generative model

Given a learned similarity measure between two different spaces, one can now use support points to generate new data in the other space. More concretely, given a learned similarity measure $\mathcal{S}$ , and a support point $x_{s}\in X$ , one can generate an unseen data point $\tilde{x}_{s}$ in $\tilde{X}$ using the following optimization problem

\displaystyle\min_{\tilde{x}}1-\mathcal{S}(x_{s},\tilde{x}).

(20)

Such tasks occur naturally in many scenarios; for example, in the case of language translation between two languages, one can generate new unseen similar sentences in another language based on learned similarity measure. Despite being an optimization problem in feature space, this is efficient in our quantum case. This is because features are directly encoded as parameters in our PQCs and thus can be efficiently differentiated[29]. To illustrate this, we again invoke the example discussed in subsection III.2. Given an image, we solve Equation 20 to find the closest point in $\mathbb{R}^{2}$ for a given image of type “right”. We show this in Figure 11, where we plot the cost function of this minimization problem and the minimization steps.

V Summary

We have considered a generalization of similarity learning techniques in the quantum setting. PQCs express similarity functions with a richer set of properties like (a)symmetry, intransitivity, and even multi-space metrics. Similarity learning boils down to learning pair wise similarity by embedding the input features in Hilbert space. We illustrate the effect of using partial measurements and their use in modeling imbalanced data. Using a synthetic image data set with left and right pixels blocked, we learn the similarity of these images $w.r.t$ arbitrary points in $\mathbb{R}^{2}$ space. This example is also used to illustrate the generalizing capability of these learned models. We use the model to detect phase transition from left-like image to right-like image. Finally, we show three applied use cases where learned similarity can be used. Classification, being a trivial use case of learning similarity, can be augmented to encode given side-information about the data using the multi-space property of generalized similarity learning. Graph link completion problem can be rephrased as a similarity learning problem, which we numerically showed for a simple example. Finally, we demonstrate how similarity models can be used as generative models to generate unseen data in the complementary space, given corresponding data in the original space. Nevertheless, there are still critical challenges regarding the choice of the PQC embedding one chooses for the given data.

V.1 Acknowledgments

The authors thank Jack Baker for insightful discussions.

Appendix A Analytical details of toy-problem

We here outline the details of setup discussed in subsection III.1. As discussed, one can think of the PQC in Figure 5 (a1/a2) in two equivalent ways - 1. $U(x)=R_{y}(x,0);V(\tilde{x})=CNOT(0,1)R_{x}(\tilde{x},1)$ (or move the $CNOT$ to $U$ ) or 2. $\tilde{U}(x,\tilde{x})=R_{y}(x,0)CNOT(0,1)R_{x}(\tilde{x},1)$ . First one corresponds to individual mappings while the second one takes the pair values and map it to a point in Hilbert space, both of which are equivalent views. The matrix form of $\tilde{U}$ is given by

\displaystyle\tilde{U}(x,\tilde{x})=\left[\begin{matrix}\cos{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&i\sin{\left(\frac{x_{1}}{2}\right)}\sin{\left(\frac{x_{2}}{2}\right)}&-i\sin{\left(\frac{x_{2}}{2}\right)}\cos{\left(\frac{x_{1}}{2}\right)}&-\sin{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}\\ i\sin{\left(\frac{x_{1}}{2}\right)}\sin{\left(\frac{x_{2}}{2}\right)}&\cos{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&-\sin{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&-i\sin{\left(\frac{x_{2}}{2}\right)}\cos{\left(\frac{x_{1}}{2}\right)}\\ \sin{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&-i\sin{\left(\frac{x_{2}}{2}\right)}\cos{\left(\frac{x_{1}}{2}\right)}&-i\sin{\left(\frac{x_{1}}{2}\right)}\sin{\left(\frac{x_{2}}{2}\right)}&\cos{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}\\ -i\sin{\left(\frac{x_{2}}{2}\right)}\cos{\left(\frac{x_{1}}{2}\right)}&\sin{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&\cos{\left(\frac{x_{1}}{2}\right)}\cos{\left(\frac{x_{2}}{2}\right)}&-i\sin{\left(\frac{x_{1}}{2}\right)}\sin{\left(\frac{x_{2}}{2}\right)}\end{matrix}\right],

(21)

from which we see that

\displaystyle\absolutevalue{\expectationvalue{\tilde{U}(x,\tilde{x})}{00}}^{2}=\cos^{2}{\left(\frac{x_{1}}{2}\right)}\cos^{2}{\left(\frac{x_{2}}{2}\right)},

(22)

which is given in Equation 12. Similarly, we get Equation 13 by computing

\displaystyle\sum_{i\in\{0,1\}}\absolutevalue{\expectationvalue{\tilde{U}(x,\tilde{x})}{0i}}^{2}=\frac{\cos{\left(x-\tilde{x}\right)}+\cos{\left(x+\tilde{x}\right)}}{4}+\frac{1}{2},

(23)

which trivially from Equation 21.

Appendix B Optimization for learning

Throughout the paper, as mentioned in subsubsection III.2.1, we have used the COBYLA[26] optimization scheme for learning the similarity. Because of the costly function evaluations required to compute pair-wise terms in Equation 18, we apply the heuristics of stochastic batching to reduce the number of evaluations. The heuristics are as follows:

•

We pick $N^{b}_{i}/2$ random pairs of points from each similar/dissimilar pairs from the training set for $i^{th}$ iteration making the set $\mathcal{X}_{N}^{i}=X_{N^{b}_{i}}\times\tilde{X}_{N^{b}_{i}}$ .
•

We compute $L_{i}(\theta,\eta)=\frac{1}{N^{b}_{i}}\sum_{(x,\tilde{x})\in\mathcal{X}_{N}^{i}}(\mathcal{S}_{\theta,\eta}(x,\tilde{x})-y_{x,\tilde{x}})^{2}$
•

We use this $L_{i}$ in the COBYLA optimization routine until convergence is achieved.

We caution that this “stochastic” form of COBYLA is similar to stochastic gradient descent techniques[30], but has no theoretical backing yet in the literature. Having said that, this heuristic, although employed for reducing the computational time, could offer potential benefits that exist in stochastic gradient descent. Primarily, the stochastic batching heuristic could be used to reduce the number of function evaluations required for learning and push towards the global minimum much faster than non-stochastic, where we have lots of local maxima/minima. One can think of this as approximating the ”effective” manifold of the cost function having an error bar. To this end, it is essential to understand what numerical choice of $N^{b}_{i}$ is best. Naively one could assume that $N^{b}_{i}$ is indeed data-dependent as the features in the data could make the landscape more complicated. To understand this better, we calculate the effect of $N^{b}_{i}$ on the cost function by computing the average of $L_{i}$ over a range of $N^{b}_{i}$ values for various runs. We then plot the average cost function over the range of $N^{b}_{i}$ for two different similarity learning problems - Classifying two clustered blob data set and two moon data set, which is shown in the Figure 12.

We see that, at least in the case of blob vs. moon, where one would assume learning blob is more straightforward than learning similarity between moon, the batch size dependence on the variance of cost manifold is almost the same. We also observe that around batch size $=10^{2}$ , the variance in cost manifold $\approx 10^{-2}$ for both blob and moon data set. This is because the cost function is very smooth, and the batch size is big enough to approximate the exact cost manifold, despite the actual batch size cost manifold having $\approx 1500$ pairs. We thus use a batch size of $\approx 81$ throughout the numerical experiments in the paper.

Appendix C The effect of partial measurements

We present some preliminary data on the effect of measuring only some qubits in the image data experiment of subsubsection III.2.1. For each $k=1,2,3,4$ , generate two separated clusters, labeled “red” and “blue”, of points in the $k$ -dimensional unit cube $\tilde{X}=[0,1]^{k}$ ; the preceding discussion considered the case $k=2$ . As before we try to learn the associations (“Right”, {blue cluster}), (“Right”, not {red cluster}), (“Left”, {red cluster}), (“Left”, not {blue cluster}). We then run $100$ instances of randomly generated synthetic data of this form and run the learning algorithm for all the allowed trace qubit dimension. Figure 13 shows the loss as a function of training iterations for each dimension of $\tilde{X}$ .

Next we run the training multiple times for each dimension of $\tilde{X}$ and for each choice of qubits to be measured. The average (mean) minimum training losses encountered during each training is shown in Figure 14, together with their variance. Overall there appears to be a slight degradation in training quality as more qubits are measured. Note that for a given problem (denoted by color in plot ), all the trace experiments are done with the same data-set. We see that smaller Hilbert space seem to attain lower average cost value compared to measuring all qubits. We see that the variance in the cost drastically decreases as well. Both these phenomenon point us in future directions to explore for further understanding the effects of partial measurements in learning models.

References

Schultz and Joachims [2004] M. Schultz and T. Joachims, Learning a distance metric from relative comparisons, Advances in neural information processing systems 16, 41 (2004).
Shalev-Shwartz et al. [2004] S. Shalev-Shwartz, Y. Singer, and A. Y. Ng, Online and batch learning of pseudo-metrics, in Proceedings of the twenty-first international conference on Machine learning (2004) p. 94.
Chechik et al. [2009] G. Chechik, V. Sharma, U. Shalit, and S. Bengio, An online algorithm for large scale image similarity learning, (2009).
Bellet et al. [2013] A. Bellet, A. Habrard, and M. Sebban, A survey on metric learning for feature vectors and structured data, arXiv preprint arXiv:1306.6709 (2013).
Nicolae et al. [2015] M.-I. Nicolae, É. Gaussier, A. Habrard, and M. Sebban, Joint semi-supervised similarity learning for linear classification, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases (Springer, 2015) pp. 594–609.
Bengio et al. [2003] Y. Bengio, J.-f. Paiement, P. Vincent, O. Delalleau, N. Roux, and M. Ouimet, Out-of-sample extensions for lle, isomap, mds, eigenmaps, and spectral clustering, Advances in neural information processing systems 16, 177 (2003).
Roweis and Saul [2000] S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290, 2323 (2000).
Cox and Cox [2008] M. A. Cox and T. F. Cox, Multidimensional scaling, in Handbook of data visualization (Springer, 2008) pp. 315–347.
Belkin and Niyogi [2003] M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural computation 15, 1373 (2003).
Kerenidis et al. [2019] I. Kerenidis, J. Landman, and A. Prakash, Quantum algorithms for deep convolutional neural networks (2019), arXiv:1911.01117 [quant-ph] .
Dallaire-Demers and Killoran [2018] P.-L. Dallaire-Demers and N. Killoran, Quantum generative adversarial networks, Physical Review A 98, 012324 (2018).
Radha [2021] S. K. Radha, Quantum constraint learning for quantum approximate optimization algorithm (2021), arXiv:2105.06770 [quant-ph] .
Coyle et al. [2021] B. Coyle, M. Henderson, J. C. J. Le, N. Kumar, M. Paini, and E. Kashefi, Quantum versus classical generative modelling in finance, Quantum Science and Technology 6, 024013 (2021).
Liu et al. [2021] Y. Liu, S. Arunachalam, and K. Temme, A rigorous and robust quantum speed-up in supervised machine learning, Nature Physics 17, 1013 (2021).
Sweke et al. [2021] R. Sweke, J.-P. Seifert, D. Hangleiter, and J. Eisert, On the quantum versus classical learnability of discrete distributions, Quantum 5, 417 (2021).
Huang et al. [2021] H.-Y. Huang, M. Broughton, M. Mohseni, R. Babbush, S. Boixo, H. Neven, and J. R. McClean, Power of data in quantum machine learning, Nature communications 12, 1 (2021).
Schuld [2021] M. Schuld, Quantum machine learning models are kernel methods, arXiv e-prints , arXiv (2021).
Pronobis and Müller [2020] W. Pronobis and K.-R. Müller, Kernel methods for quantum chemistry, in Machine Learning Meets Quantum Physics (Springer, 2020) pp. 25–36.
Lloyd et al. [2020a] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and N. Killoran, Quantum embeddings for machine learning (2020a), arXiv:2001.03622 [quant-ph] .
Bromley et al. [1993] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Säckinger, and R. Shah, Signature verification using a “siamese” time delay neural network, International Journal of Pattern Recognition and Artificial Intelligence 7, 669 (1993).
Havlíček et al. [2019] V. Havlíček, A. D. Córcoles, K. Temme, A. W. Harrow, A. Kandala, J. M. Chow, and J. M. Gambetta, Supervised learning with quantum-enhanced feature spaces, Nature 567, 209 (2019).
Balcan et al. [2008] M.-F. Balcan, A. Blum, and N. Srebro, Improved guarantees for learning via similarity functions, (2008).
Lloyd et al. [2020b] S. Lloyd, M. Schuld, A. Ijaz, J. Izaac, and N. Killoran, Quantum embeddings for machine learning, (2020b), arXiv:2001.03622 [quant-ph] .
Hubregtsen et al. [2021] T. Hubregtsen, D. Wierichs, E. Gil-Fuster, P.-J. H. S. Derks, P. K. Faehrmann, and J. J. Meyer, Training quantum embedding kernels on near-term quantum computers, (2021), arXiv:2105.02276 [quant-ph] .
Lloyd et al. [2013] S. Lloyd, M. Mohseni, and P. Rebentrost, Quantum algorithms for supervised and unsupervised machine learning (2013), arXiv:1307.0411 [quant-ph] .
Powell [2007] M. J. Powell, A view of algorithms for optimization without derivatives, Mathematics Today-Bulletin of the Institute of Mathematics and its Applications 43, 170 (2007).
Shiina et al. [2020] K. Shiina, H. Mori, Y. Okabe, and H. K. Lee, Machine-learning studies on spin models, Scientific reports 10, 1 (2020).
Bai et al. [2019] L. Bai, L. Rossi, L. Cui, J. Cheng, and E. R. Hancock, A quantum-inspired similarity measure for the analysis of complete weighted graphs, IEEE transactions on cybernetics 50, 1264 (2019).
Benedetti et al. [2019] M. Benedetti, E. Lloyd, S. Sack, and M. Fiorentini, Parameterized quantum circuits as machine learning models, Quantum Science and Technology 4, 043001 (2019).
Bottou [2010] L. Bottou, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010 (Springer, 2010) pp. 177–186.