This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales

Tao Liu1, P. R. Kumar1, Ruida Zhou1, Xi Liu2
1Texas A&M University, College Station, TX, USA
2Applied Machine Learning, Facebook AI, Menlo Park, CA, USA
{tliu, prk, ruida}@tamu.edu, [email protected]
Abstract

Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the maximum similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds with high probability for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.

1 Introduction

With the goal of learning when the number of training samples is small, and motivated by the success of CNNs [35, 31], we wish to endow SVMs with as much a priori domain knowledge as possible.

One such important domain property for image recognition is translational invariance. An image of a dog remains an image of the same dog if the image is shifted to the left. Similarly, it is also rotation-invariant. More generally, given a group of transformations under which the classification of images is invariant, we show how to endow SVMs with the knowledge of such invariance.

One common approach is data augmentation [29, 6], where several transformations of each training sample are added to the training set. When applying SVMs on this augmented training set, it corresponds to a kernel that defines the similarity between two vectors X1X_{1} and X2X_{2} as the average similarity between X1X_{1} and all transformations of X2X_{2}. However the average also includes transformations that are maximally dissimilar, and we show that it leads to poor margins and poor classification results. More appealing is to construct a kernel that defines similarity as the maximum similarity between X1X_{1} and all transformations of X2X_{2}. We show that this kernel is positive definite with high probability in the small sample size regime of interest to us, under a probabilistic model for features. We verify this property on widely available datasets and show that the improvement obtained by endowing SVMs with this transformation invariance yields considerably better test accuracy.

Another important domain property for image recognition is the “locality" of features, e.g., an edge depends only on a sharp gradient between neighboring pixels. Through operations such as max-pooling, CNNs exploit locality at multiple spatial scales. We show how one may incorporate such locality into polynomial SVMs.

Finally, their multi-layer architecture provides CNNs the benefits of composition [7]. We show how one can iteratively introduce multiple layers into SVMs to facilitate composition. The introduction of multiple layers increases computational complexity, and we show how this can be alleviated by parallel computation so as to achieve a reduction of computation time by increasing memory.

We show experimentally that the resulting SVMs provide significantly improved performance for small datasets. Translational and rotational invariance embedded into SVMs allows them to recognize objects that have not already been centered in images or oriented in upright positions; referred to as transformed datasets in the sequel. The transformation-invariant SVMs provide significant improvements over SVMs as well as CNN benchmarks without data augmentation when the training set is small. For 100/200/500 training samples, the recognition accuracy of the MNIST dataset [17] is increased, respectively, from the figures of 68.33%/83.20%/91.33% reported by the CNNs optimized over architectures and dimensions [10] to, respectively, 81.55%/89.23%/93.11%. Similar improvements are also obtained in the EMNIST Letters [4] and Transformed MNIST datasets. Computational results reported here are for small datasets, handled efficiently by LIBSVM [2].

1.1 Background

In the early 2000s, SVMs [5] were one of the most effective methods for image classification [19]. They had a firm theoretical foundation of margin maximization [32, 1] that is especially important in high dimensions, and a kernel method to make high-dimensional computation tractable. However, with the advent of CNNs and the enhanced computing power of GPUs, SVMs were replaced in image classification. One reason was that CNNs were able to incorporate prior knowledge. The pioneering paper [35] emphasizes that “It is usually accepted that good generalization performance on real-world problems cannot be achieved unless some a priori knowledge about the task is built into the system. Back-propagation networks provide a way of specifying such knowledge by imposing constraints both on the architecture of the network and on its weights. In general, such constraints can be considered as particular transformations of the parameter space." It further mentions specifically that, “Multilayer constrained networks perform very well on this task when organized in a hierarchical structure with shift-invariant feature detectors." Indeed, CNNs have successfully incorporated several important characteristics of images. One, mentioned above (called shift-invariance), is translational invariance, which is exploited by the constancy, i.e., location independence, of the convolution matrices. A second is locality. For example, an “edge" in an image can be recognized from just the neighboring pixels. This is exploited by the low dimensionality of the kernel matrix. A third characteristic is the multiplicity of spatial scales, i.e., a hierarchy of spatial “features" of multiple sizes in images. These three properties are captured in modern CNNs through the “pooling" operation at the (+1)(\ell+1)-th layer, where the features of the \ell-th layer are effectively low-pass filtered through operations such as max-pooling. More recently, it has been shown that depth in neural networks (NNs) of rectified linear units (ReLUs) permits composition, enhances expressivity for a given number of parameters, and reduces the number needed for accurate approximations [7].

Generally, neural networks have become larger over time with the number of parameters ranging into hundreds of millions, concomitantly also data-hungry, and therefore inapplicable in applications where data is expensive or scarce, e.g., medical data [20]. For these reasons, there is an interest in methodologies for learning efficiently from very few samples, which is the focus of this paper.

1.2 Related Work

There are mainly two sets of studies: on transformation-invariant kernels and on local correlations.

Transformation-Invariant Kernels. There are two major routes to explore transformation invariant kernels [26, 16]. The most widely used is based on Haar-integration kernels, called “average-fit” kernels in this paper, which average over a transformation group [27, 12, 11, 23, 21, 22]. Although trivially positive definite, they appear to produce poor results in our testing (Section 4). Mairal et al. [21] reported some good results when a large dataset is available for pre-training bottom network layers unsupervised [24], but our focus is on data-scarce situations where no such large dataset is available. This paper concentrates on “best-fit” kernels, which are based on the maximum similarity over transformations. “Jittering kernels” [8, 9] calculate the minimum distance between one sample and all jittered forms of another sample, analogous to this paper. Instead of measuring the minimum distance between samples, tangent distance kernels [30, 13] use a linear approximation to measure the minimum distance between sets generated by the transformation group, which can only guarantee local invariance. In comparison, our “best-fit” kernel is defined as the maximum inner product between one sample and all transformed forms of another sample, which enjoys global invariance and differs from jittering kernels for non-norm-preserving transformations (e.g., scaling transformations). Although “best-fit” kernels enjoy a good performance, they are not guaranteed to be positive definite, thus the global optimality of the SVM solution cannot be guaranteed theoretically. We address this lacuna and show that positive definiteness indeed holds with high probability in the small training sample set regime for the “best-fit” kernel. Additionally, we show that locality at multiple scales can further improve the performance of “best-fit” kernels. We note that there were also some works that treat an indefinite kernel matrix as a noisy observation of some unknown positive semidefinite matrix [3, 36], but our goal is to analyze conditions that make the “best-fit” kernel positive definite.

Local Correlations. Since “local correlations," i.e., dependencies between nearby pixels, are more pronounced than long-range correlations in natural images, Scholkopf et al. [25] defined a two-layer kernel utilizing dot product in a polynomial space which is mainly spanned by local correlations between pixels. We extend the structure of two-layer local correlation to multilayer architectures by introducing further compositions, which gives the flexibility to consider the locality at multiple spatial scales. We also analyze the corresponding time and space complexity of multilayer architectures.

2 Kernels with Transformational Invariance

To fix notation, let 𝒮={(X1,y1),,(Xn,yn)}\mathcal{S}=\{(X_{1},y_{1}),\dots,(X_{n},y_{n})\} be a set of nn labeled samples with mm-dimensional feature vectors Xi=(Xi(1),,Xi(m))T𝒳mX_{i}=(X^{(1)}_{i},\ldots,X^{(m)}_{i})^{T}\in\mathcal{X}\subset\mathbb{R}^{m} and labels yi𝒴\ y_{i}\in\mathcal{Y}.

2.1 Transformation Groups and Transformation-Invariant Best-Fit Kernels

We wish to endow the kernel of the SVM with the domain knowledge that the sample classification is invariant under certain transformations of the sample vectors. Let 𝒢{\cal{G}} be a transformation group that acts on 𝒳\mathcal{X}, i.e., for all S,T,U𝒢S,T,U\in{\cal{G}}: (i) TT maps 𝒳\mathcal{X} into 𝒳\mathcal{X}; (ii) the identity map I𝒢I\in{\cal{G}}; (iii) ST𝒢ST\in{\cal{G}}; (iv) (ST)U=S(TU)(ST)U=S(TU); (v) there is an inverse T1𝒢T^{-1}\in{\cal{G}} with TT1=T1T=ITT^{-1}=T^{-1}T=I.

We start with a base kernel Kbase:𝒳×𝒳K_{base}:\mathcal{X}\times\mathcal{X}\to\mathcal{R} that satisfies the following properties:

  1. 1.

    Symmetry, i.e., Kbase(Xi,Xj)=Kbase(Xj,Xi)K_{base}(X_{i},X_{j})=K_{base}(X_{j},X_{i});

  2. 2.

    Positive definiteness, i.e., positive semi-definiteness of its Gram matrix: i=1nj=1nαiαjKbase(Xi,Xj)0\sum_{i=1}^{n}\sum_{j=1}^{n}\alpha_{i}\alpha_{j}K_{base}(X_{i},X_{j})\geq 0 for all α1,,αn\alpha_{1},\dots,\alpha_{n}\in\mathbb{R};

  3. 3.

    Kbase(TXi,Xj)=Kbase(Xi,T1Xj),T𝒢K_{base}(TX_{i},X_{j})=K_{base}(X_{i},T^{-1}X_{j}),\forall T\in{\cal{G}}.

Define the kernel with the “best-fit" transformation over 𝒢{\cal{G}} by

K𝒢,best,base(Xi,Xj):=supT𝒢Kbase(TXi,Xj).\displaystyle K_{{\cal{G}},best,base}(X_{i},X_{j}):=\sup_{T\in{\cal{G}}}K_{base}(TX_{i},X_{j}). (1)
Lemma 1.

K𝒢,best,baseK_{{\cal{G}},best,base} is a symmetric kernel that is also transformation-invariant over the group 𝒢{\cal{G}}, i.e., K𝒢,best,base(TXi,Xj)K_{{\cal{G}},best,base}(TX_{i},X_{j}) = K𝒢,best,base(Xi,Xj)K_{{\cal{G}},best,base}(X_{i},X_{j}).

Proof. The symmetry follows since

K𝒢,best,base(Xi,Xj)=supT𝒢Kbase(TXi,Xj)=supT𝒢Kbase(Xi,T1Xj)\displaystyle K_{{\cal{G}},best,base}(X_{i},X_{j})=\sup_{T\in{\cal{G}}}K_{base}(TX_{i},X_{j})=\sup_{T\in{\cal{G}}}K_{base}(X_{i},T^{-1}X_{j})
=\displaystyle= supT𝒢Kbase(T1Xj,Xi)=supT1𝒢Kbase(T1Xj,Xi)=K𝒢,best,base(Xj,Xi).\displaystyle\sup_{T\in{\cal{G}}}K_{base}(T^{-1}X_{j},X_{i})=\sup_{T^{-1}\in{\cal{G}}}K_{base}(T^{-1}X_{j},X_{i})=K_{{\cal{G}},best,base}(X_{j},X_{i}).

The transformational invariance follows since

K𝒢,best,base(TXi,Xj)=supS𝒢Kbase(STXi,Xj)=supUT1𝒢Kbase(UXi,Xj)\displaystyle K_{{\cal{G}},best,base}(TX_{i},X_{j})=\sup_{S\in{\cal{G}}}K_{base}(STX_{i},X_{j})=\sup_{UT^{-1}\in{\cal{G}}}K_{base}(UX_{i},X_{j})
=\displaystyle= supU𝒢Kbase(UXi,Xj)=K𝒢,best,base(Xi,Xj).\displaystyle\sup_{U\in{\cal{G}}}K_{base}(UX_{i},X_{j})=K_{{\cal{G}},best,base}(X_{i},X_{j}).

The Translation Group: Of particular interest in image classification is the group of translations, a subgroup of the group of transformations. Let Xi={Xi(p,q):p[m1],q[m2]}X_{i}=\{X_{i}^{(p,q)}:p\in[m_{1}],q\in[m_{2}]\} denote a two-dimensional m1×m2m_{1}\times m_{2} array of pixels, with m=m1m2m=m_{1}m_{2}. Let TrsXi:={Xi((pr)modm1,(qs)modm2):p[m1],q[m2]}T_{rs}X_{i}:=\{X_{i}^{((p-r)\operatorname{mod}m_{1},(q-s)\operatorname{mod}m_{2})}:p\in[m_{1}],q\in[m_{2}]\} denote the transformation that translates the array by rr pixels in the xx-direction, and by ss pixels in the yy-direction. The translation group is 𝒢trans:={Trs:r[m1],s[m2]}{\cal{G}}_{trans}:=\{T_{rs}:r\in[m_{1}],s\in[m_{2}]\}. For notational simplicity, we will denote the resulting kernel K𝒢trans,best,baseK_{{\cal{G}}_{trans},best,base} by KTI,best,baseK_{TI,best,base}.

2.2 Positive Definiteness of Translation-Invariant Best-Fit Kernels

There are two criteria that need to be met when trying to embed transformational invariance into SVM kernels. (i) The kernel will need to be invariant with respect to the particular transformations of interest in the application domain. (ii) The kernel will need to be positive definite to have provable guarantees of performance. KTI,best,baseK_{TI,best,base} satisfies property (i) as established in Lemma 1. Concerning property (ii) though, in general, KTI,best,baseK_{TI,best,base} is an indefinite kernel. We now show that when the base kernel is a normalized linear kernel, Klinear(Xi,Xj):=1mXiTXjK_{linear}(X_{i},X_{j}):=\frac{1}{m}X_{i}^{T}X_{j}, then it is indeed positive definite in the small sample regime of interest under a probabilistic model for dependent features.

Assumption 1.

Suppose that nn samples {X1,X2,,Xn}\{X_{1},X_{2},\dots,X_{n}\} are i.i.d., with Xi={Xi(1),Xi(2),,Xi(m)}X_{i}=\{X_{i}^{(1)},X_{i}^{(2)},\dots,X_{i}^{(m)}\} being a normal random vector with Xi𝒩(0,Σ)X_{i}\sim\mathcal{N}(0,\Sigma), i[n]\forall i\in[n]. Suppose also that λ2/λ(lnm)1+ϵ2/2\|\lambda\|_{2}/\|\lambda\|_{\infty}\geq(\ln m)^{\frac{1+\epsilon}{2}}/2 for some ϵ(0,1]\epsilon\in(0,1], where λ:=(λ(1),,λ(m))\lambda:=(\lambda^{(1)},\dots,\lambda^{(m)}) is comprised of the eigenvalues of Σ\Sigma. Note that Xi(p)X_{i}^{(p)} may be correlated with Xi(q)X_{i}^{(q)}, for pqp\neq q.

Example 1.

When Σ=Im\Sigma=I_{m}, i.e., Xi(p)𝒩(0,1),p[m]X_{i}^{(p)}\sim\mathcal{N}(0,1),\forall p\in[m], the condition λ2/λ(lnm)1+ϵ2/2\|\lambda\|_{2}/\|\lambda\|_{\infty}\geq(\ln m)^{\frac{1+\epsilon}{2}}/2 holds trivially since λ2=m\|\lambda\|_{2}=\sqrt{m} and λ=1\|\lambda\|_{\infty}=1. We note that for independent features, we can relax the normality (Assumption 1) to sub-Gaussianity.

Theorem 1.

Let

KTI,best,linear(Xi,Xj)\displaystyle K_{TI,best,linear}(X_{i},X_{j}) :=supT𝒢trans1m(TXi)TXj\displaystyle:=\sup_{T\in{\cal{G}}_{trans}}\frac{1}{m}(TX_{i})^{T}X_{j} (2)

be the best-fit translation invariant kernel with the base kernel chosen as the normalized linear kernel. Under Assumption 1, if nλ12λ2(lnm)1+ϵ2n\leq\frac{\|\lambda\|_{1}}{2\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}, then KTI,best,linearK_{TI,best,linear} is a positive definite kernel with probability approaching one, as mm\to\infty.

Outline of proof.

We briefly explain ideas behind the proof, with details presented in Appendix B. For brevity, we denote KTI,best,linear(Xi,Xj)K_{TI,best,linear}(X_{i},X_{j}) and Klinear(Xi,Xj)K_{linear}(X_{i},X_{j}) by KTI,ijK_{TI,ij} and KijK_{ij}, respectively. From Gershgorin’s circle theorem [33] every eigenvalue of KTI,best,linearK_{TI,best,linear} lies within at least one of the Gershgorin discs 𝒟(KTI,ii,ri):={λ||λKTI,ii|ri}\mathcal{D}(K_{TI,ii},r_{i}):=\{\lambda\in\mathbb{R}\ |\ |\lambda-K_{TI,ii}|\leq r_{i}\}, where ri:=ji|KTI,ij|r_{i}:=\sum_{j\neq i}|K_{TI,ij}|. Hence if KTI,ii>ji|KTI,ij|,iK_{TI,ii}>\sum_{j\neq i}|K_{TI,ij}|,\ \forall i, then KTIK_{TI} is a positive definite kernel. Accordingly, we study a tail bound on ji|KTi,ij|\sum_{j\neq i}|K_{Ti,ij}| and an upper bound on KTI,iiK_{TI,ii}, respectively, to complete the proof. ∎

Note that λ1mλ2\|\lambda\|_{1}\leq\sqrt{m}\|\lambda\|_{2} for an mm-length vector λ\lambda, which implies that n=O~(m)n=\tilde{O}(\sqrt{m}).

We now show that positive definiteness in the small sample regime also holds for the polynomial kernels which are of importance in practice:

Kpoly(Xi,Xj):=(1+γmXiTXj)d for γ0 and d.\displaystyle K_{poly}(X_{i},X_{j}):=(1+\frac{\gamma}{m}X_{i}^{T}X_{j})^{d}\mbox{ for }\gamma\geq 0\mbox{ and }d\in\mathbb{N}.
Theorem 2.

For any γ+\gamma\in\mathbb{R}_{+} and dd\in\mathbb{N}, the translation-invariant kernels,

KTI,best,poly(Xi,Xj)\displaystyle K_{TI,best,poly}(X_{i},X_{j}) :=supT𝒢trans(1+γm(TXi)TXj)d\displaystyle:=\sup_{T\in{\cal{G}}_{trans}}(1+\frac{\gamma}{m}(TX_{i})^{T}X_{j})^{d} (3)

are positive definite with probability approaching one as m+m\to+\infty, under Assumption 1, when λ1\|\lambda\|_{\infty}\leq 1, and nλ12λ2(lnm)1+ϵ2n\leq\frac{\|\lambda\|_{1}}{2\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}.

Proof.

Note that λ1\|\lambda\|_{\infty}\leq 1 can be satisfied simply by dividing all entries of the XiX_{i}s by λ\sqrt{\|\lambda\|_{\infty}}. The detailed proof is presented in Appendix C. ∎

Remark. Since Gershgorin’s circle theorem is a conservative bound on the eigenvalues of a matrix, the bound n=O~(m)n=\tilde{O}(\sqrt{m}) on the number of samples for positive definiteness is also conservative. In practice, positive definiteness of KTI,best,linearK_{TI,best,linear} holds for larger nn. Even more usefully, KTI,best,polyK_{TI,best,poly} is positive definite for a much larger range of nn than KTI,best,linearK_{TI,best,linear}, as reported in Table 1.

Table 1: The value of nn up to which the kernel is positive definite. Positive definiteness continues to hold for moderate sample sizes, indicating that the theorem is conservative.
Datasets KTI,best,linearK_{TI,best,linear} KTI,best,polyK_{TI,best,poly}
Original MNIST 45\approx 45 375\approx 375
EMNIST 35\approx 35 395\approx 395
Translated MNIST 455\approx 455 15000\approx 15000

2.3 Comparison with the Average-Fit Kernel and Data Augmentation

2.3.1 Average-Fit Kernel

In [12], the following “average-fit kernel" kernel is considered

K𝒢trans,avg,linear(Xi,Xj)\displaystyle K_{\mathcal{G}_{trans},avg,linear}(X_{i},X_{j}) :=1|𝒢]T𝒢Klinear(TXi,Xj),\displaystyle:=\frac{1}{|{\cal{G}}]}\sum_{T\in{\cal{G}}}K_{linear}(TX_{i},X_{j}), (4)

which seeks the “average" fit over all transformations. We denote it by KTI,avg,linear(Xi,Xj)K_{TI,avg,linear}(X_{i},X_{j}) for short. It is trivially invariant with respect to the transformations in 𝒢\mathcal{G} and positive definite. However, it is not really a desirable choice for translations when the base kernel is the linear kernel. Note that 1|𝒢trans]T𝒢transTXi=α(1,1,,1)T\frac{1}{|{\cal{G}}_{trans}]}\sum_{T\in{\cal{G}}_{trans}}TX_{i}=\alpha(1,1,\ldots,1)^{T}, where α=1mp[m1],q[m2]Xi(p,q)\alpha=\frac{1}{m}\sum_{p\in[m_{1}],q\in[m_{2}]}X^{(p,q)}_{i} is the average brightness level of XiX_{i}. Therefore KTI,avg,linear(Xi,Xj)=m×K_{TI,avg,linear}(X_{i},X_{j})=m\times (Avg brightness level of XiX_{i}) ×\times (Avg brightness level of XjX_{j}). The kernel solely depends on the average brightness levels of the samples, basically blurring out all details in the samples. In the case of rotational invariance, it depends only on the average brightness along each concentric circumference. As expected, it produces very poor results, as seen in the experimental results in Section 4.

2.3.2 Data Augmentation

Data augmentation is a popular approach to learn how to recognize translated images. It augments the dataset by creating several translated copies of the existing samples. (A special case is the virtual support vectors method which augments the data with transformations of the support vectors and retrains [26].) SVM with kernel KbaseK_{base} applied to the augmented data is equivalent to employing the average-fit kernel as (4). Consider the case where the augmented data consists of all translates of each image. The resulting dual problem for SVM margin maximization [5] is:

maxλ\displaystyle\max_{\lambda}\ 12i,j,T1,T2λi,T1λj,T2yiyjKbase(T1Xi,T2Xj)+i,Tλi,T\displaystyle-\frac{1}{2}\sum_{i,j,T_{1},T_{2}}\lambda_{i,T_{1}}\lambda_{j,T_{2}}y_{i}y_{j}K_{base}(T_{1}X_{i},T_{2}X_{j})+\sum_{i,T}\lambda_{i,T}
s.t. λi,T0,i[n],T𝒢trans;i,Tλi,Tyi=0.\displaystyle\ \lambda_{i,T}\geq 0,\ \forall i\in[n],\forall T\in\mathcal{G}_{trans};\ \ \sum_{i,T}\lambda_{i,T}y_{i}=0.

The corresponding classifier is sign(i,Tλi,TyiKbase(TXi,X)+b)\text{sign}(\sum_{i,T}\lambda_{i,T}^{*}y_{i}K_{base}(TX_{i},X)+b^{*}), where λ\lambda^{*} is the optimal dual variable and b=yji,Tλi,TyiKbase(TXi,TXj)b^{*}=y_{j}-\sum_{i,T}\lambda_{i,T}^{*}y_{i}K_{base}(TX_{i},T^{\prime}X_{j}), for any jj and TT^{\prime} satisfying λj,T>0\lambda_{j,T^{\prime}}^{*}>0. When no data augmentation is implemented, i.e., |𝒢trans|=1|\mathcal{G}_{trans}|=1, we use λi\lambda_{i} as shorthand for λi,1\lambda_{i,1}. As shown in Theorem 4.1 [18], this is simply the dual problem for the SVM with KTI,avg,baseK_{TI,avg,base}, and iλiKTI,avg,base(Xi,Xj)=i,T𝒢λi,TKbase(TXi,Xj)j\sum_{i}\lambda_{i}K_{TI,avg,base}(X_{i},X_{j})=\sum_{i,T\in\mathcal{G}}\lambda_{i,T}K_{base}(TX_{i},X_{j})\forall j. Hence data augmentation is mathematically equivalent to a kernel with average similarity over all transformations. This yields a poor classifier since it only leverages the average brightness level of an image.

A simple example illustrates superiority of KTI,best,linearK_{TI,best,linear} over KTI,avg,linearK_{TI,avg,linear} or data augmentation.

Example 2.

Consider a special case of the translation group with m1=2m_{1}=2 and m2=1m_{2}=1. Consider a training set with just two samples X1=(1,2)X_{1}=(1,2) and X2=(5,2)X_{2}=(5,2), shown in red in Figure 1. Note that a translation in the image space is a right shift of vector elements. (So a translation of X1=(1,2)X_{1}=(1,2) yields the augmented datum X3=(2,1)X_{3}=(2,1) in this two-dimensional context because of wraparound.) Two new samples X3=(2,1)X_{3}=(2,1) and X4=(2,5)X_{4}=(2,5) are therefore generated through data augmentation, shown in green. The decision boundary of the base kernel with augmented data (equivalent to the average-fit kernel KTI,avgK_{TI,avg}) is shown by the blue solid line in Figure 1. Note that the linear decision boundary X(1)+X(2)=5X^{(1)}+X^{(2)}=5 depends solely on the brightness level X(1)+X(2)X^{(1)}+X^{(2)}.

However, the decision boundary of the best-fit kernel KTI,best,linear(Xi,Xj)=supT𝒢trans12(TXi)TXjK_{TI,best,linear}(X_{i},X_{j})=\sup_{T\in{\cal{G}}_{trans}}\frac{1}{2}(TX_{i})^{T}X_{j} is piecewise linear due to the “sup” operation. Each piece focuses only on the half of the samples that are on the same side of the symmetry axis (black dashed line), leading to the red piecewise linear separatrix with a larger margin. (The red margin is 102\frac{\sqrt{10}}{2}, which is larger than the blue margin 2\sqrt{2}.) The best-fit kernels thereby maximize the margin over the larger class of piecewise linear separatrices by exploiting the invariances. Even when data augmentation is used, it only produces linear separatrices and thus still has smaller margins. Benefits will be even more pronounced in higher dimensions. For other kernels (e.g., polynomial kernels), the shape of the decision boundary will be altered correspondingly (e.g., piecewise polynomial), but the TI best-fit kernel will still provide a larger margin.

Refer to caption
(a)
Refer to caption
(b) Translated MNIST (T-MNIST)
Refer to caption
(c) Rotated MNIST (R-MNIST)
Figure 1: (a) The kernel KTI,best,linearK_{TI,best,linear} produces the piecewise linear separatrix shown in red. It yields a larger margin than the blue linear separatrix, i.e., decision boundary, that data augmentation and KTI,avgK_{TI,avg} yield; (b) & (c) Transformed MNIST with random translation/rotation and Gaussian noise (μ=0,σ=0.1\mu=0,\sigma=0.1).

2.4 Rotation-Invariant Kernels

To mathematically treat rotation-invariant kernels, consider images that are circular, of radius rr, with each concentric annulus from radius (p1)rm1\frac{(p-1)r}{m_{1}} to prm1\frac{pr}{m_{1}}, for p[m1]p\in[m_{1}], comprised of m2m_{2} pixels spanning the sectors [2kπ/m2,2(k+1)π/m2)[2k\pi/m_{2},2(k+1)\pi/m_{2}) for k=0,1,,m21k=0,1,\ldots,m_{2}-1. Denote the element in the pp-th annulus and qq-th sector as “pixel" X(p,q)X^{(p,q)}, and define the rotation group 𝒢rotate:={T1,T2,,Tm2}\mathcal{G}_{rotate}:=\{T_{1},T_{2},...,T_{m_{2}}\}, where TqX(p,q)=X(p,q+q)T_{q^{\prime}}X^{(p,q)}=X^{(p,q+q^{\prime})}. The rotation-invariant (RI) best-fit kernel is

KRI,best,base(Xi,Xj):=supT𝒢rotateKbase(TXi,Xj).\displaystyle K_{RI,best,base}(X_{i},X_{j}):=\sup_{T\in\mathcal{G}_{rotate}}K_{base}(TX_{i},X_{j}).
Lemma 2.

Under Assumption 1 , the rotational invariant kernel KRI,best,polyK_{RI,best,poly} is positive definite with probability approaching one as m+m\to+\infty, under the same conditions as in Theorem 1.

In Section 4, we report on the large performance gain of KRI,best,polyK_{RI,best,poly}.

3 Incorporating Locality at Multiple Spatial Scales

To better illustrate the property of “locality" and its incorporation into SVMs, consider the simple context of a polynomial kernel and a one-dimensional real-valued pixel sequence.

Let us regard the individual pixel values in the sequence {Xi(1),Xi(2),,Xi(m)}\{X_{i}^{(1)},X_{i}^{(2)},\ldots,X_{i}^{(m)}\} as the primitive features at “Layer 0". Consider now a “local" feature depending only on the nearby pixels {Xi(),Xi(+1),,Xi(+k1)}\{X_{i}^{(\ell)},X_{i}^{(\ell+1)},\ldots,X_{i}^{(\ell+k_{1})}\} that can be modeled by a polynomial of degree d1d_{1}. We refer to k1k_{1} as the locality parameter.

Such a local feature is a linear combination of monomials of the form j=min(+k1,m)(Xi(j))cj\prod_{j=\ell}^{\min(\ell+k_{1},m)}(X_{i}^{(j)})^{c_{j}} with j=min(+k1,m)cjd1\sum_{j=\ell}^{\min(\ell+k_{1},m)}c_{j}\leq d_{1} where each integer cj0c_{j}\geq 0. This gives rise to a kernel

KL,ij(1):=[=1mk1(p=+k1Xi(p)Xj(p)+1)d1+1]d2.K_{L,ij}^{(1)}:=[\sum_{\ell=1}^{m-k_{1}}(\sum_{p=\ell}^{\ell+k_{1}}X_{i}^{(p)}X_{j}^{(p)}+1)^{d_{1}}+1]^{d_{2}}. (5)

We regard “Layer 1" as comprised of such local features of locality parameter k1k_{1}, at most k1k_{1} apart.

“Layer 2" allows the composition of local features at Layer 1 that are at most k2k_{2} apart:

KL,ij(2)\displaystyle K_{L,ij}^{(2)} :={g=1mk1k2[=gg+k2(p=+k1Xi(p)Xj(p)+1)d11]d2+1}d3.\displaystyle:=\{\sum_{g=1}^{m-k_{1}-k_{2}}[\sum_{\ell=g}^{g+k_{2}}(\sum_{p=\ell}^{\ell+k_{1}}X_{i}^{(p)}X_{j}^{(p)}+1)^{d_{1}}1]^{d_{2}}+1\}^{d_{3}}. (6)

This can be recursively applied to define deeper kernels with locality at several coarser spatial scales.

The above procedure extends naturally to two-dimensional images {Xi(p,q):p[m1],q[m2]}\{X_{i}^{(p,q)}:p\in[m_{1}],q\in[m_{2}]\}. Then the kernel at Layer 1 is simply (s=1m2k1=1m1k1(q=ss+k1p=+k1Xi(p,q)Xj(p,q)+1)d1+1)d2(\sum_{s=1}^{m_{2}-k_{1}}\sum_{\ell=1}^{m_{1}-k_{1}}(\sum_{q=s}^{s+k_{1}}\sum_{p=\ell}^{\ell+k_{1}}X_{i}^{(p,q)}X_{j}^{(p,q)}+1)^{d_{1}}+1)^{d_{2}}. The resulting kernels are always positive definite:

Lemma 3.

KLK_{L} is a positive definite kernel.

Proof. Note that if K1K_{1} and K2K_{2} are positive definite kernels, then the following kernels KK obtained by Schur products [28], addition, or adding a positive constant elementwise, are still positive definite kernels: (i) Kij=αK1,ij+βK2,ijK_{ij}=\alpha K_{1,ij}+\beta K_{2,ij}, (ii) Kij=(K1,ij)1(K2,ij)2K_{ij}=(K_{1,ij})^{\ell_{1}}(K_{2,ij})^{\ell_{2}}, (iii) Kij=K1,ij+γK_{ij}=K_{1,ij}+\gamma, α,β0,1,2,γ0\forall\alpha,\beta\geq 0,\ell_{1},\ell_{2}\in\mathbb{N},\gamma\geq 0. The kernel KLK_{L} can be obtained by repeatedly employing the above operations with α=β=γ=1\alpha=\beta=\gamma=1, starting with a base linear kernel which is positive definite with high probability under the conditions of Theorem 2. ∎

One difference from CNNs is that, for the same input layer, one cannot have multiple output channels. If we design multiple channels with different degrees, then the channel with a larger degree will automatically subsume all terms generated by the channel with a smaller degree. Therefore, it is equivalent to having only one output channel with the largest degree. However, if the images have multiple channels to start with (as in R, G, and B, for example), then they can be handled separately. But after they are combined at a layer, there can only be one channel at subsequent higher layers.

Combining Locality at Multiple Spatial Scales with Transformational Invariance.

To combine both locality at multiple spatial scales and transformational invariance, a kernel with locality at multiple spatial scales can be introduced as a base kernel into transformation-invariant kernels.

Complexity Analysis and Memory Trade-off.

One may trade off between the memory requirement and computation time when it comes to the depth of the architecture. Supported by adequate memory space, one can store all kernel values from every layer, with both computation time and memory space increasing linearly with depth. In contrast, when limited by memory space, one can store only the kernel values from the final layer. In that case, although the memory requirement does not increase with depth, computation time grows exponentially with depth.

The time complexity of computing the polynomial kernel is between O(n2m)O(n^{2}m) and O(n3m)O(n^{3}m) based on LIBSVM [2], while space complexity is O(n2)O(n^{2}). With sufficient memory of order O(n2m)O(n^{2}m), the computations of kernel values can be parallelized so that the time complexity of the locality kernel is considerably reduced to between O(n2kd)O(n^{2}kd) and O(n3kd)O(n^{3}kd), where kk and dd are the locality parameter and the depth, respectively, with kdmkd\ll m. Note that since our focus is on small sample sizes, the O(n2)O(n^{2}) complexity is acceptable.

4 Experimental Evaluation

4.1 Datasets

We evaluate the performance of the methods developed on four datasets:

  • 1.

    The Original MNIST Dataset [17]

  • 2.

    The EMNIST Letters Dataset [4]

  • 3.

    The Translated MNIST Dataset: Since most generally available datasets appear to have already been centered or otherwise preprocessed, we “uncenter" them to better verify the accuracy improvement of TI kernels. We place the objects in a larger (64*64*1) canvas, and then randomly translate them so that they are not necessarily centered but still maintain their integrity. In addition, we add a Gaussian noise (μ=0,σ=0.1\mu=0,\sigma=0.1) to avoid being able to accurately center the image by calculating the center-of-mass. We call the resulting dataset the “Translated dataset". Figure 1(b) shows some samples from different classes of the Translated MNIST dataset.

  • 4.

    The Rotated MNIST Dataset: We place the original image in the middle of a larger canvas and rotate it, with the blank area after rotation filled with Gaussian noise. RI kernels are not designed to and cannot distinguish between equivalent elements (e.g., 6 versus 9), and so we skip them. Figure 1(c) displays samples from different classes of the Rotated MNIST dataset.

4.2 Experimental Results and Observations

Table 2: Original MNIST Dataset and EMNIST Letters Dataset (100, 200, 500 training samples): Test accuracy of newly proposed methods compared with the original SVM, the tangent distance (TD) nearest neighbors (two-sided), the RI-SVM based on Average Fit, and the best CNN. Based on the same training set, our fine-tuned ResNet achieves similar performance as in [10].
Method Original MNIST EMNIST Letters
100 200 500 100 200 500
Acc/% Acc/% Acc/% Acc/% Acc/% Acc/%
L-TI-RI-SVM 81.55 89.23 92.58 44.56 55.18 66.42
TI-RI-SVM 75.10 86.47 93.11 43.16 52.40 67.42
L-TI-SVM 78.86 87.02 91.01 42.51 52.81 64.66
L-RI-SVM 77.96 83.96 89.65 38.39 47.29 59.76
TI-SVM 69.34 82.34 91.00 39.94 48.12 63.52
RI-SVM 73.82 83.60 90.19 38.03 45.02 59.04
L-SVM 75.27 82.11 88.21 37.01 45.08 58.05
\cdashline1-7[0.8pt/2pt] SVM 68.16 78.67 87.14 36.65 42.74 56.38
TD (two-sided) [30] 73.04 81.68 88.15 37.99 45.31 57.63
RI-SVM (Average-Fit) 68.05 78.81 87.21 36.82 42.41 56.22
Best CNN [10] / ResNet 68.33 83.20 91.33 33.82 53.17 57.06

Table 2 and Figure 2 provide the test accuracy of all methods on the Original and Transformed MNIST Datasets, respectively, while Table 2 shows the test accuracy for the EMNIST Letters Dataset [4]. The letters L, TI, and RI represent Locality at multiple spatial scales, TI kernels, and RI kernels, respectively. A combination such as L-TI represents an SVM with both Locality and Translational Invariance. Note that the intended application of our proposed methods is to learn when data is scarce and there is no large database of similar data, which precludes the possibility of pre-training. We provide code at https://github.com/tliu1997/TI-SVM.

For the Original MNIST dataset with 100/200/500 training samples (Table 2), after introducing locality and transformational invariance, the classification accuracy is improved from 68.33%/83.20%/91.33% reported by the best CNNs optimized over architectures and dimensions [10] to 81.55%/89.23%/93.11% respectively. The improvements indicate that the original dataset does not center and deskew objects perfectly. Larger improvements can be observed on the EMNIST Letters dataset [4] compared with the original SVM, two-sided tangent distance (TD) nearest neighbors [30], RI kernel based on Average-Fit, and ResNet. (Since we find that the test accuracy of TD-based nearest neighbors [30] is better than that of TD-based SVMs [13], we only report results of TD-based nearest neighbors as one of the baselines.) All results are multi-class classification accuracies.

In Figure 2, we present the obtained test accuracy as a function of the number of training samples, for two different transformed datasets. Experiments are performed 5 times with mean and standard deviation denoted by the length of the bars around the mean, for 100/300/500/700/1000 training samples respectively. (L-)TI-SVM and (L-)RI-SVM outperform ResNet in many cases when there is no data augmentation since they embed useful domain knowledge for classifiers, especially for the small size regime of training samples. However, with the increase in the number of training samples, the benefits brought by domain knowledge gradually decrease, as shown in Figure 2. Additionally, the test accuracy of the newly proposed methods has a smaller variance than ResNet’s in general.

From the experimental results, we see that all SVMs with newly defined kernels improve upon the test accuracy of the original SVM method, whether they are original datasets or transformed datasets. They also greatly outperform the best CNNs in the small training sample regime of interest. For transformed datasets, improvements are more significant. Note that the performance improvement of proposed methods comes at the cost of longer computation time. When computation time is critical, we may simply use new single methods, i.e., L-SVM, TI-SVM, and RI-SVM, which enjoy a relatively good performance at a small cost of additional computation time.

Refer to caption
(a) Translated MNIST.
Refer to caption
(b) Rotated MNIST.
Figure 2: Test accuracy vs. Number of training samples, for Transformed MNIST datasets. L-TI-SVM yields considerable improvement at small sizes in the case of translated samples, while, similarly, L-RI-SVM does so in the case of rotated samples.

4.3 Details of Experimental Evaluation

With consideration of computational speed and memory space, we utilize a two-layer structure (5) as well as a k112\frac{k_{1}-1}{2}-zero padding and a stride of 1 to implement locality. In order to compare the test accuracy of L-SVM, TI-SVM, RI-SVM and further combine them, we select a polynomial kernel with a fixed degree (8 in our experiments) to realize the proposed methods. Note that degree 8 is not necessarily the optimal degree; one can tune the specific degree for different datasets.

We compare our results with [10], which adopts a tree building technique to examine all the possible combinations of layers and corresponding dimensionalities to find the optimal CNN architecture. As for the DNN benchmark of the EMNIST Letters and the Transformed MNIST datasets, we select ResNet [14, 15], a classic CNN architecture, as a DNN benchmark. Plus, for fairness, we do not implement data augmentation for any models and train all from scratch.

Note that all experimental results are based on LIBSVM [2] and are carried out on an Intel Xeon E5-2697A V4 Linux server with a maximum clock rate of 2.6 GHz and a total memory of 512 GB.

5 Concluding Remarks

In this paper, we have developed transformation-invariant kernels that capture domain knowledge of the invariances in the domain. They can also additionally incorporate composition and locality at multiple spatial scales. The resulting kernels appear to provide significantly superior classification performance in the small sample size regime that is of interest in this paper. Experiments demonstrate that for the same polynomial kernel, incorporating locality and transformational invariance improves accuracy, especially for situations where data is scarce.

Acknowledgement

This material is based upon work by Texas A&M University that is partially supported by the US Army Contracting Command under W911NF-22-1-0151, US Office of Naval Research under N00014-21-1-2385, 4/21-22 DARES: Army Research Office W911NF-21-20064, US National Science Foundation under CMMI-2038625. The views expressed herein and conclusions contained in this document are those of the authors and should not be interpreted as representing the views or official policies, either expressed or implied, of the U.S. Army Contracting Command, ONR, ARO, NSF, or the United States Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

References

  • [1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pages 144–152, 1992.
  • [2] C.-C. Chang and C.-J. Lin. Libsvm: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • [3] J. Chen and J. Ye. Training svm with indefinite kernels. In Proceedings of the 25th international conference on Machine learning, pages 136–143, 2008.
  • [4] G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik. Emnist: Extending mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 2921–2926. IEEE, 2017.
  • [5] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [6] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. Conference on Computer Vision and Pattern Recognition, 2019.
  • [7] I. Daubechies, R. DeVore, S. Foucart, B. Hanin, and G. Petrova. Nonlinear approximation and (deep) relu networks. Constructive Approximation, pages 1–46, 2021.
  • [8] D. DeCoste and M. C. Burl. Distortion-invariant recognition via jittered queries. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), volume 1, pages 732–737. IEEE, 2000.
  • [9] D. DeCoste and B. Schölkopf. Training invariant support vector machines. Machine learning, 46(1):161–190, 2002.
  • [10] R. N. D’Souza, P.-Y. Huang, and F.-C. Yeh. Structural analysis and optimization of convolutional neural networks with a small sample size. Scientific Reports, 10(1):1–13, Nature Publishing Group, 2020.
  • [11] K. Ghiasi-Shirazi, R. Safabakhsh, and M. Shamsi. Learning translation invariant kernels for classification. Journal of Machine Learning Research, 11(4), 2010.
  • [12] B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine learning. Machine learning, 68(1):35–61, 2007.
  • [13] B. Haasdonk and D. Keysers. Tangent distance kernels for support vector machines. In Object recognition supported by user interaction for service robots, volume 2, pages 864–868. IEEE, 2002.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  • [16] F. Lauer and G. Bloch. Incorporating prior knowledge in support vector machines for classification: A review. Neurocomputing, 71(7-9):1578–1594, 2008.
  • [17] Y. LeCun and C. Cortes. MNIST handwritten digit database. 2010.
  • [18] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov, and S. Arora. Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809, 2019.
  • [19] D. Lu and Q. Weng. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5):823–870, 2007.
  • [20] A. Maier. Will we ever solve the shortage of data in medical applications? https://towardsdatascience.com/will-we-ever-solve-the-shortage-of-data-in-medical-applications-70da163e2c2d, 2020. Accessed: 2022-01-22.
  • [21] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. Advances in neural information processing systems, 27, 2014.
  • [22] S. Mei, T. Misiakiewicz, and A. Montanari. Learning with invariances in random features and kernel models. Conference on Learning Theory, 2021.
  • [23] Y. Mroueh, S. Voinea, and T. A. Poggio. Learning with group invariant features: A kernel perspective. Advances in Neural Information Processing Systems, 28, 2015.
  • [24] M. Ranzato, F. J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
  • [25] B. Schölkopf, P. Simard, A. J. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In Advances in neural information processing systems, pages 640–646, 1998.
  • [26] B. Schölkopf, A. J. Smola, F. Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • [27] H. Schulz-Mirbach. Constructing invariant features by averaging techniques. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3-Conference C: Signal Processing (Cat. No. 94CH3440-5), volume 2, pages 387–390. IEEE, 1994.
  • [28] J. Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. 1911.
  • [29] C. Shorten and T. M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48, 2019.
  • [30] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri. Transformation invariance in pattern recognition—tangent distance and tangent propagation. In Neural networks: tricks of the trade, pages 239–274. Springer, 1998.
  • [31] M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
  • [32] V. N. Vapnik. The nature of statistical learning theory. springer, 1995.
  • [33] R. S. Varga. Geršgorin and his circles, volume 36. Springer Science & Business Media, 2010.
  • [34] M. J. Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [35] Y. le Cun. Generalization and network design strategies. Connectionism in perspective, 19:143–155, 1989.
  • [36] Y. Ying, C. Campbell, and M. Girolami. Analysis of svm with indefinite kernels. Advances in neural information processing systems, 22, 2009.

Appendix A Supporting Definitions and Lemmas

Definition 1.

A random variable X with mean μ=𝔼[X]\mu=\mathbb{E}[X] is sub-exponential if there are non-negative parameters (ν,b)(\nu,b) such that [34]

𝔼[exp(λ(Xμ))]exp(ν2λ22),|λ|<1b.\displaystyle\mathbb{E}[\exp{(\lambda(X-\mu))}]\leq\exp(\frac{\nu^{2}\lambda^{2}}{2}),\ \ \forall|\lambda|<\frac{1}{b}.

We denote this by XSE(ν,b)X\sim SE(\nu,b).

Lemma 4 (Sub-exponential tail bound [34]).

Suppose that XSE(ν,b)X\sim SE(\nu,b). Then,

[|Xμ|t]{2exp(t22ν2)if 0tν2b,2exp(t2b)if t>ν2b.\displaystyle\mathbb{P}[|X-\mu|\geq t]\leq\begin{cases}2\exp(-\frac{t^{2}}{2\nu^{2}})&\text{if }0\leq t\leq\frac{\nu^{2}}{b},\\ 2\exp(-\frac{t}{2b})&\text{if }t>\frac{\nu^{2}}{b}.\end{cases}

Some properties related to sub-exponential random variables are listed below.

Lemma 5 ([34]).
  1. 1.

    For a standard normal random variable XX, X2X^{2} is SE(2,4)(2,4);

  2. 2.

    For random variable XSE(ν,b)X\sim SE(\nu,b), aXSE(aν,ab)aX\sim SE(a\nu,ab);

  3. 3.

    Consider independent random variables X1,,XnX_{1},\ldots,X_{n}, where XiSE(νi,bi)X_{i}\sim SE(\nu_{i},b_{i}). Let ν=(ν1,,νn)\nu=(\nu_{1},\ldots,\nu_{n}) and b=(b1,,bn)b=(b_{1},\ldots,b_{n}). Then i=1nXiSE(ν2,b)\sum_{i=1}^{n}X_{i}\sim SE(\|\nu\|_{2},\|b\|_{\infty}).

Appendix B Proof of Theorem 1

Theorem 3 (Restatement of Theorem 1).

Let

KTI,best,linear(Xi,Xj)\displaystyle K_{TI,best,linear}(X_{i},X_{j}) :=supT𝒢trans1m(TXi)TXj\displaystyle:=\sup_{T\in{\cal{G}}_{trans}}\frac{1}{m}(TX_{i})^{T}X_{j}

be the best-fit translation invariant kernel with the base kernel chosen as the normalized linear kernel. Under Assumption 1, if nλ12λ2(lnm)1+ϵ2n\leq\frac{\|\lambda\|_{1}}{2\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}, then KTI,best,linearK_{TI,best,linear} is a positive definite kernel with probability approaching one, as mm\to\infty.

Proof of Theorem 1.

For brevity, we denote KTI,best,linear(Xi,Xj)K_{TI,best,linear}(X_{i},X_{j}) and Klinear(Xi,Xj)K_{linear}(X_{i},X_{j}) by KTI,ijK_{TI,ij} and KijK_{ij}, respectively. From Gershgorin’s circle theorem [33] every eigenvalue of KTI,best,linearK_{TI,best,linear} lies within at least one of the Gershgorin discs 𝒟(KTI,ii,ri):={λ||λKTI,ii|ri}\mathcal{D}(K_{TI,ii},r_{i}):=\{\lambda\in\mathbb{R}\ |\ |\lambda-K_{TI,ii}|\leq r_{i}\}, where ri:=ji|KTI,ij|r_{i}:=\sum_{j\neq i}|K_{TI,ij}|. Hence if KTI,ii>ji|KTI,ij|,iK_{TI,ii}>\sum_{j\neq i}|K_{TI,ij}|,\ \forall i, then KTI,best,linearK_{TI,best,linear} is a positive definite kernel.

Note that the dimension of the vector XiX_{i} is m=m1×m2m=m_{1}\times m_{2} in the case of a two-dimensional array. Under Assumption 1, Xi𝒩(0,Σ),i[n]X_{i}\sim\mathcal{N}(0,\Sigma),\forall i\in[n]. Since Σ\Sigma is symmetric and positive semi-definite, there exists an orthogonal matrix Om×mO\in\mathbb{R}^{m\times m} with Σ=Odiag(λ(1),,λ(m))OT\Sigma=O\cdot\operatorname{diag}\left(\lambda^{(1)},\ldots,\lambda^{(m)}\right)\cdot O^{T}, where λ(1),,λ(m)0\lambda^{(1)},\ldots,\lambda^{(m)}\geq 0 are the eigenvalues of Σ\Sigma. Define Σ12:=Odiag(λ(1),,λ(m))OT\Sigma^{\frac{1}{2}}:=O\cdot\operatorname{diag}\left(\sqrt{\lambda^{(1)}},\ldots,\sqrt{\lambda^{(m)}}\right)\cdot O^{T}, then Xi=Σ12ZiX_{i}=\Sigma^{\frac{1}{2}}Z_{i}, where Zi𝒩(0,Im)Z_{i}\sim\mathcal{N}(0,I_{m}). Define Hi:=OTZi𝒩(0,Im)H_{i}:=O^{T}Z_{i}\sim\mathcal{N}(0,I_{m}), then

Xi2\displaystyle\|X_{i}\|^{2} =Odiag(λ(1),,λ(m))OTZi2\displaystyle=\left\|O\operatorname{diag}\left(\sqrt{\lambda^{(1)}},\ldots,\sqrt{\lambda^{(m)}}\right)O^{T}Z_{i}\right\|^{2}
=diag(λ(1),,λ(m))Hi2=p=1mλ(p)(Hi(p))2,\displaystyle=\left\|\operatorname{diag}\left(\sqrt{\lambda^{(1)}},\ldots,\sqrt{\lambda^{(m)}}\right)H_{i}\right\|^{2}=\sum_{p=1}^{m}\lambda^{(p)}(H_{i}^{(p)})^{2},

and 𝔼[Xi2]=p=1mλ(p)𝔼[(Hi(p))2]=λ1\mathbb{E}[\|X_{i}\|^{2}]=\sum_{p=1}^{m}\lambda^{(p)}\mathbb{E}[(H_{i}^{(p)})^{2}]=\|\lambda\|_{1}. Let λ:=(λ(1),,λ(m))\lambda:=(\lambda^{(1)},\dots,\lambda^{(m)}). Based on Lemma 5, we have (Hi(p))2SE(2,4)(H_{i}^{(p)})^{2}\sim SE(2,4), λ(p)(Hi(p))2SE(2λ(p),4λ(p))\lambda^{(p)}(H_{i}^{(p)})^{2}\sim SE(2\lambda^{(p)},4\lambda^{(p)}), and Xi2SE(2λ2,4λ)\|X_{i}\|^{2}\sim SE(2\|\lambda\|_{2},4\|\lambda\|_{\infty}). According to Lemma 4,

[1mXi2λ1mtm]{exp(t28λ22)if 0tλ22λ,exp(t8λ)if t>λ22λ.\displaystyle\mathbb{P}\left[\frac{1}{m}\|X_{i}\|^{2}\leq\frac{\|\lambda\|_{1}}{m}-\frac{t}{m}\right]\leq\begin{cases}\exp(-\frac{t^{2}}{8\|\lambda\|_{2}^{2}})&\text{if }0\leq t\leq\frac{\|\lambda\|_{2}^{2}}{\|\lambda\|_{\infty}},\\ \exp(-\frac{t}{8\|\lambda\|_{\infty}})&\text{if }t>\frac{\|\lambda\|_{2}^{2}}{\|\lambda\|_{\infty}}.\end{cases}

Let t=λ1/2t=\|\lambda\|_{1}/2, then we have

(Kii12mλ1)\displaystyle\mathbb{P}\left(K_{ii}\leq\frac{1}{2m}\|\lambda\|_{1}\right) max(exp(λ1232λ22),exp(λ116λ))\displaystyle\leq\max\left(\exp\left(-\frac{\|\lambda\|_{1}^{2}}{32\|\lambda\|_{2}^{2}}\right),\exp\left(-\frac{\|\lambda\|_{1}}{16\|\lambda\|_{\infty}}\right)\right)
(a)exp(λ132λ),\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\exp\left(-\frac{\|\lambda\|_{1}}{32\|\lambda\|_{\infty}}\right),

where (a)(a) holds due to Holder’s inequality. Noting that KTI,ii=maxT𝒢Klinear(TXi,Xi)Kii,iK_{TI,ii}=\max_{T\in\mathcal{G}}K_{linear}(TX_{i},X_{i})\geq K_{ii},\forall i,

(KTI,ii12mλ1)\displaystyle\mathbb{P}(K_{TI,ii}\leq\frac{1}{2m}\|\lambda\|_{1}) (Kii12mλ1)exp(λ132λ).\displaystyle\leq\mathbb{P}(K_{ii}\leq\frac{1}{2m}\|\lambda\|_{1})\leq\exp\left(-\frac{\|\lambda\|_{1}}{32\|\lambda\|_{\infty}}\right). (7)

Now we turn to the off-diagonal terms KTI,ijK_{TI,ij} for iji\neq j. For p[m]p\in[m], one can write

Xi(p)Xj(p)=(OΛ12Hi)T(OΛ12Hj)=HiTΛHj=p=1mλ(p)Hi(p)Hj(p).\displaystyle X_{i}^{(p)}X_{j}^{(p)}=(O\Lambda^{\frac{1}{2}}H_{i})^{T}(O\Lambda^{\frac{1}{2}}H_{j})=H_{i}^{T}\Lambda H_{j}=\sum_{p=1}^{m}\lambda^{(p)}H_{i}^{(p)}H_{j}^{(p)}.

Note that Hi(p)Hj(p)=12(Y+,ij(p)Y,ij(p))(Y+,ij(p)+Y,ij(p))H_{i}^{(p)}H_{j}^{(p)}=\frac{1}{2}(Y_{+,ij}^{(p)}-Y_{-,ij}^{(p)})(Y_{+,ij}^{(p)}+Y_{-,ij}^{(p)}), where Y+,ij(p):=12(Hj(p)+Hi(p))Y_{+,ij}^{(p)}:=\frac{1}{\sqrt{2}}(H_{j}^{(p)}+H_{i}^{(p)}) and Y,ij(p):=12(Hj(p)Hi(p))Y_{-,ij}^{(p)}:=\frac{1}{\sqrt{2}}(H_{j}^{(p)}-H_{i}^{(p)}) are independent N(0,1)N(0,1) random variables. Hence (Y+,ij(p))2(Y_{+,ij}^{(p)})^{2} and (Y,ij(p))2(Y_{-,ij}^{(p)})^{2} are chi-squared random variables, and their moment generating functions are 𝔼[er(Y+,ij(p))2]=𝔼[er(Y,ij(p))2]=112r\mathbb{E}[e^{r(Y_{+,ij}^{(p)})^{2}}]=\mathbb{E}[e^{r(Y_{-,ij}^{(p)})^{2}}]=\frac{1}{\sqrt{1-2r}} for r<12r<\frac{1}{2}. Hence for any |r|1/λ|r|\leq 1/\|\lambda\|_{\infty}, we know

𝔼[erXiXj]\displaystyle\mathbb{E}[e^{rX_{i}^{\top}X_{j}}] =𝔼[exp(rp=1mλ(p)Hi(p)Hj(p)))]=𝔼[exp(r2p=1mλ(p)((Y+,ij(p))2(Y,ij(p))2)))]\displaystyle=\mathbb{E}\left[\exp\left(r\sum_{p=1}^{m}\lambda^{(p)}H_{i}^{(p)}H_{j}^{(p)})\right)\right]=\mathbb{E}\left[\exp\left(\frac{r}{2}\sum_{p=1}^{m}\lambda^{(p)}\left((Y_{+,ij}^{(p)})^{2}-(Y_{-,ij}^{(p)})^{2})\right)\right)\right]
=(b)p=1m𝔼[exp(rλ(p)2(Y+,ij(p))2)]𝔼[exp(rλ(p)2(Y,ij(p))2)]\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\prod_{p=1}^{m}\mathbb{E}\left[\exp(\frac{r\lambda^{(p)}}{2}(Y_{+,ij}^{(p)})^{2})\right]\mathbb{E}\left[\exp(-\frac{r\lambda^{(p)}}{2}(Y_{-,ij}^{(p)})^{2})\right]
=p=1m11(λ(p))2r2,\displaystyle=\prod_{p=1}^{m}\frac{1}{\sqrt{1-(\lambda^{(p)})^{2}r^{2}}},

where (b)(b) is true since the random variables {Y+,ij(p),Y,ij(p)}p[m]\{Y_{+,ij}^{(p)},Y_{-,ij}^{(p)}\}_{p\in[m]} are mutually independent. It can be verified that 11xex\frac{1}{\sqrt{1-x}}\leq e^{x} for 0x1/20\leq x\leq 1/2. We can then upper bound the moment generating function of 𝔼[erXiXj]\mathbb{E}[e^{rX_{i}^{\top}X_{j}}], since for any |r|1/λ|r|\leq 1/\|\lambda\|_{\infty},

𝔼[erXiXj]=p=1m11(λ(p))2r2p=1mexp((λ(p))2r2)=exp(2λ222r2).\displaystyle\mathbb{E}[e^{rX_{i}^{\top}X_{j}}]=\prod_{p=1}^{m}\frac{1}{\sqrt{1-(\lambda^{(p)})^{2}r^{2}}}\leq\prod_{p=1}^{m}\exp\left((\lambda^{(p)})^{2}r^{2}\right)=\exp\left(\frac{2\|\lambda\|_{2}^{2}}{2}r^{2}\right).

This implies XiXjSE(2λ2,λ)X_{i}^{\top}X_{j}\sim SE(\sqrt{2}\|\lambda\|_{2},\|\lambda\|_{\infty}). Since 𝔼[XiXj]=0\mathbb{E}[X_{i}^{\top}X_{j}]=0, by Lemma 4, we know

[XiXjt]{exp(t24λ22)if 0t2λ22λ,exp(t2λ)if t>2λ22λ.\displaystyle\mathbb{P}\left[X_{i}^{\top}X_{j}\geq t\right]\leq\begin{cases}\exp(-\frac{t^{2}}{4\|\lambda\|_{2}^{2}})&\text{if }0\leq t\leq\frac{2\|\lambda\|_{2}^{2}}{\|\lambda\|_{\infty}},\\ \exp(-\frac{t}{2\|\lambda\|_{\infty}})&\text{if }t>\frac{2\|\lambda\|_{2}^{2}}{\|\lambda\|_{\infty}}.\end{cases}

Let t=λ2(lnm)(1+ϵ)/2t=\|\lambda\|_{2}(\ln m)^{(1+\epsilon)/2}. Under the assumption that λ2λ(lnm)(1+ϵ)/22\frac{\|\lambda\|_{2}}{\|\lambda\|_{\infty}}\geq\frac{(\ln m)^{(1+\epsilon)/2}}{2}, we have

[Kijλ2(lnm)1+ϵ2m]exp((lnm)1+ϵ4).\displaystyle\mathbb{P}\left[K_{ij}\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m}\right]\leq\exp(-\frac{(\ln m)^{1+\epsilon}}{4}). (8)

Note that |KTI,ij|=|supTKlinear(TXi,Xj)|supT|Klinear(TXi,Xj)||K_{TI,ij}|=|\sup_{T}K_{linear}(TX_{i},X_{j})|\leq\sup_{T}|K_{linear}(TX_{i},X_{j})|, then

(|KTI,ij|λ2(lnm)1+ϵ2m)\displaystyle\mathbb{P}(|K_{TI,ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m}) (supT|Klinear(TXi,Xj)|λ2(lnm)1+ϵ2m)\displaystyle\leq\mathbb{P}(\sup_{T}|K_{linear}(TX_{i},X_{j})|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m})
(c)m(|Kij|λ2(lnm)1+ϵ2m)(d)2m(Kijλ2(lnm)1+ϵ2m)\displaystyle\stackrel{{\scriptstyle(c)}}{{\leq}}m\mathbb{P}(|K_{ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m})\stackrel{{\scriptstyle(d)}}{{\leq}}2m\mathbb{P}(K_{ij}\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m})
exp((lnm)1+ϵ4+ln2m),\displaystyle\leq\exp(-\frac{(\ln m)^{1+\epsilon}}{4}+\ln 2m), (9)

where (c)(c) holds since the distribution is translation invariant, and (d)(d) holds since KijK_{ij} is a symmetric random variable, i.e., pKij(y)=pKij(y)p_{K_{ij}}(y)=p_{K_{ij}}(-y). Combining (7) and (B), we have

(KTI,ii>ji|KTI,ij|,i)(miniKTI,ii>maxiji|KTIij|)\displaystyle\mathbb{P}(K_{TI,ii}>\sum_{j\neq i}|K_{TI,ij}|,\forall i)\geq\mathbb{P}(\min_{i}K_{TI,ii}>\max_{i}\sum_{j\neq i}|K_{TI_{i}j}|)
(miniKTI,ii>λ12m and maxiji|KTI,ij|<λ2(lnm)1+ϵ2nm)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \geq\mathbb{P}(\min_{i}K_{TI,ii}>\frac{\|\lambda\|_{1}}{2m}\text{ and }\max_{i}\sum_{j\neq i}|K_{TI,ij}|<\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}n}{m})
=1(miniKTI,iiλ12m or maxiji|KTI,ij|λ2(lnm)1+ϵ2nm)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =1-\mathbb{P}(\min_{i}K_{TI,ii}\leq\frac{\|\lambda\|_{1}}{2m}\text{ or }\max_{i}\sum_{j\neq i}|K_{TI,ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}n}{m})
1(miniKTI,iiλ12m)(maxiji|KTi,ij|λ2(lnm)1+ϵ2nm)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \geq 1-\mathbb{P}(\min_{i}K_{TI,ii}\leq\frac{\|\lambda\|_{1}}{2m})-\mathbb{P}(\max_{i}\sum_{j\neq i}|K_{Ti,ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}n}{m})
=1(KTI,iiλ12m,i)(ji|KTi,ij|λ2(lnm)1+ϵ2nm),i)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =1-\mathbb{P}(K_{TI,ii}\leq\frac{\|\lambda\|_{1}}{2m},\ \exists i)-\mathbb{P}(\sum_{j\neq i}|K_{Ti,ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}n}{m}),\ \exists i)
(e)1n(KTI,iiλ12m)n(|KTi,ij|λ2(lnm)1+ϵ2m,ji)\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \stackrel{{\scriptstyle(e)}}{{\geq}}1-n\mathbb{P}(K_{TI,ii}\leq\frac{\|\lambda\|_{1}}{2m})-n\mathbb{P}(|K_{Ti,ij}|\geq\frac{\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m},\ \exists j\neq i)
1exp(λ132λ+lnn)exp(14(lnm)1+ϵ+2lnn+ln2m).\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \geq 1-\exp(-\frac{\|\lambda\|_{1}}{32\|\lambda\|_{\infty}}+\ln n)-\exp(-\frac{1}{4}(\ln m)^{1+\epsilon}+2\ln n+\ln 2m).

Above, (e)(e) holds since the probability distributions of |KTI,ij||K_{TI,ij}| are identical for all jij\neq i. Since λ1λ2\|\lambda\|_{1}\geq\|\lambda\|_{2}, the assumption λ2λ(lnm)(1+ϵ)/2\frac{\|\lambda\|_{2}}{\|\lambda\|_{\infty}}\geq(\ln m)^{(1+\epsilon)/2} implies λ1λ(lnm)(1+ϵ)/2\frac{\|\lambda\|_{1}}{\|\lambda\|_{\infty}}\geq(\ln m)^{(1+\epsilon)/2}. Note that since n=O~(m)n=\tilde{O}(\sqrt{m}),

limm(KTI,ii>ji|KTI,ij|,i)limm11poly(m)=1.\displaystyle\lim_{m\to\infty}\mathbb{P}(K_{TI,ii}>\sum_{j\neq i}|K_{TI,ij}|,\forall i)\geq\lim_{m\to\infty}1-\frac{1}{poly(m)}=1.

Therefore, KTI,best,linearK_{TI,best,linear} is a positive definite kernel with probability approaching one as mm\to\infty. ∎

Appendix C Proof of Theorem 2

Theorem 4 (Restatement of Theorem 2).

For any γ+\gamma\in\mathbb{R}_{+} and dd\in\mathbb{N}, the translation-invariant kernels,

KTI,best,poly(Xi,Xj)\displaystyle K_{TI,best,poly}(X_{i},X_{j}) :=supT𝒢trans(1+γm(TXi)TXj)d\displaystyle:=\sup_{T\in{\cal{G}}_{trans}}(1+\frac{\gamma}{m}(TX_{i})^{T}X_{j})^{d}

are positive definite with probability approaching 1 as m+m\to+\infty, under Assumption 1, λ1\|\lambda\|_{\infty}\leq 1, and nλ12λ2(lnm)1+ϵ2n\leq\frac{\|\lambda\|_{1}}{2\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}.

Proof.

Define event Aγ,m,n:={γm(TXi)Xj1,T𝒢,ij}A_{\gamma,m,n}:=\{\frac{\gamma}{m}(TX_{i})^{\top}X_{j}\geq-1,\forall T\in\mathcal{G},\forall i\not=j\}, and event Bm,n:={KTI,best,linearispd}B_{m,n}:=\{K_{TI,best,linear}{~{}is~{}pd}\}. Denote K(,)=(1+γKTI,best,linear(,))dK(\cdot,\cdot)=(1+\gamma K_{TI,best,linear}(\cdot,\cdot))^{d}. Conditioned on event Aγ,m,nA_{\gamma,m,n}, KTI,best,poly(Xi,Xj)=K(Xi,Xj)K_{TI,best,poly}(X_{i},X_{j})=K(X_{i},X_{j}) for off-diagonal entries, and KTI,best,poly(Xi,Xi)K(Xi,Xi)K_{TI,best,poly}(X_{i},X_{i})\geq K(X_{i},X_{i}). This implies KTI,best,polyKK_{TI,best,poly}\succeq K. Conditioned on event Bm,nB_{m,n}, the three properties in the proof of Lemma 3 indicate KK is pd. Thus KTI,best,polyK_{TI,best,poly} is pd conditioned on Aγ,m,nBm,nA_{\gamma,m,n}\cap B_{m,n}.

Now Bm,nB_{m,n} holds w.h.p. by Theorem 1. By the symmetric distribution of KijK_{ij} and (8), we have

(γm(Xi)TXjγλ2(lnm)1+ϵ2m)exp((lnm)1+ϵ4).\displaystyle\mathbb{P}(\frac{\gamma}{m}(X_{i})^{T}X_{j}\leq-\frac{\gamma\|\lambda\|_{2}(\ln m)^{\frac{1+\epsilon}{2}}}{m})\leq\exp(-\frac{(\ln m)^{1+\epsilon}}{4}).

Due to λ2m\|\lambda\|_{2}\leq\sqrt{m} and union bounds, limm(Aγ,m,n)limm11poly(m)=1.\lim_{m\to\infty}\mathbb{P}(A_{\gamma,m,n})\geq\lim_{m\to\infty}1-\frac{1}{poly(m)}=1.

Appendix D Extensions to other kernels

The property of transformational invariance can be extended to other kernels with the guarantee of positive definiteness, e.g., radial basis function (RBF) kernels with norm-preserving transformations (i.e., TXi=Xi,i[n]\|TX_{i}\|=\|X_{i}\|,\forall i\in[n]). Define the normalized base kernel as

KRBF(Xi,Xj):=exp(XiXj22mσ2)=exp(Xi2+Xj22mσ2)exp(XiTXjmσ2).\displaystyle K_{RBF}(X_{i},X_{j}):=\exp(-\frac{\|X_{i}-X_{j}\|^{2}}{2m\sigma^{2}})=\exp(-\frac{\|X_{i}\|^{2}+\|X_{j}\|^{2}}{2m\sigma^{2}})\exp(\frac{X_{i}^{T}X_{j}}{m\sigma^{2}}).

If transformations are norm-preserving, then

KTI,best,RBF(Xi,Xj)\displaystyle K_{TI,best,RBF}(X_{i},X_{j}) :=exp(Xi2+Xj22mσ2)supT𝒢transexp((TXi)TXjmσ2)\displaystyle:=\exp(-\frac{\|X_{i}\|^{2}+\|X_{j}\|^{2}}{2m\sigma^{2}})\sup_{T\in\mathcal{G}_{trans}}\exp(\frac{(TX_{i})^{T}X_{j}}{m\sigma^{2}})
=exp(Xi2+Xj22mσ2)exp(supT𝒢trans(TXi)TXjmσ2).\displaystyle=\exp(-\frac{\|X_{i}\|^{2}+\|X_{j}\|^{2}}{2m\sigma^{2}})\exp(\sup_{T\in\mathcal{G}_{trans}}\frac{(TX_{i})^{T}X_{j}}{m\sigma^{2}}).

By Theorem 1 and three properties in the proof of Lemma 3, KTI,best,rbfK_{TI,best,rbf} is still positive definite since ex=i=0xii!e^{x}=\sum_{i=0}^{\infty}\frac{x^{i}}{i!}. However, the design of locality at multiple scales doesn’t apply to RBF kernels since the kernel tricks cannot be utilized anymore. Since we merge two designs in the experimental part, we choose polynomial kernels to illustrate the designs in the paper.