Learning Kernel for Conditional Moment-Matching Discrepancy-based Image Classification

Chuan-Xian Ren, Pengfei Ge, Dao-Qing Dai, Hong Yan C.X. Ren, P.F. Ge, and D.Q. Dai are with the School of Mathematics, Sun Yat-Sen University, Guangzhou, China. C.X. Ren and P.F. Ge contribute equally to this work. Corresponding author: C.X. Ren.H. Yan is with the Department of Electronic Engineering, City University of Hong Kong, Hong Kong.This work is supported in part by the National Natural Science Foundation of China under Grants 61572536, 11631015, and U1611265, in part by the Science and Technology Program of Guangzhou under Grant 201804010248, and in part by the Hong Kong Research Grants Council (Project C1007-15G).

Abstract

Conditional Maximum Mean Discrepancy (CMMD) can capture the discrepancy between conditional distributions by drawing support from nonlinear kernel functions, thus it has been successfully used for pattern classification. However, CMMD does not work well on complex distributions, especially when the kernel function fails to correctly characterize the difference between intra-class similarity and inter-class similarity. In this paper, a new kernel learning method is proposed to improve the discrimination performance of CMMD. It can be operated with deep network features iteratively and thus denoted as KLN for abbreviation. The CMMD loss and an auto-encoder (AE) are used to learn an injective function. By considering the compound kernel, i.e., the injective function with a characteristic kernel, the effectiveness of CMMD for data category description is enhanced. KLN can simultaneously learn a more expressive kernel and label prediction distribution, thus, it can be used to improve the classification performance in both supervised and semi-supervised learning scenarios. In particular, the kernel-based similarities are iteratively learned on the deep network features, and the algorithm can be implemented in an end-to-end manner. Extensive experiments are conducted on four benchmark datasets, including MNIST, SVHN, CIFAR-10 and CIFAR-100. The results indicate that KLN achieves state-of-the-art classification performance.

Index Terms:

Conditional distribution discrepancy, Moment matching network, Supervised learning, Semi-supervised learning, Auto-Encoder, Kernel mappings

I Introduction

Data classification provides us fundamental knowledge for structured information and latent connection in given data. In machine learning, data classification refers to the problem of identifying to which category a new observation belongs, under the assumption that the training set containing observations whose category membership is known or partially known. Accordingly, classification methods are usually categorized into supervised methods and semi-supervised methods. It is worth noting that practical data are usually high dimensional with large variations, which bring additional difficulty in classification. Taking the street view image as an example, data can be captured in crowds, and they have very complex backgrounds [1, 2]. Thus, how to learn discriminative and compact feature, which also has good generalization performance, is a challenging and important problem.

In the past two decades, classification models are usually accompanied by exploration of nonlinear kernel functions, which enable classification models to be operated in a Reproducing Kernel Hilbert Space (RKHS) [3]. Generally, RKHS corresponds to a high-dimensional, implicit but more separable feature space without explicitly computing data coordinates in that space, but rather by simply computing the inner products between all sample pairs in the feature space. It has been validated that increasing the data dimension to very high even infinite by using kernel mappings and then implementing dimension reduction operations can effectively improve the classification accuracy. It also becomes potential mechanism and important experience for data analysis and feature representation [4, 5, 6]. Classification algorithms capable of operating with kernels include Perceptron [7], Support Vector Machine [8], Fisher’s linear discriminant analysis [9, 10], and several others [11, 12].

A probability distribution can also be embedded into RKHS by kernel mappings, and then some linear methods can be used to deal with higher-order statistics [3, 13]. The widely used methods are usually based on MMD and CMMD. MMD captures the discrepancy between marginal distributions, while CMMD captures the discrepancy between conditional distributions [14]. These measures are widely used in independent test [14], non-parametric test [15], image generation [16, 17], and transfer learning tasks [18, 19, 20].

MMD has been successfully applied to measure the difference between two probability distributions via kernel mean embedding. For supervised classification tasks, MMD can be used to estimate the similarity between distributions of positive and negative samples, and maximizing MMD leads to the two classes as apart as possible [21]. It has been proved that MMD of binary-class samples in a certain condition is inverse proportional to the optimal risk of the linear loss function of Parzen window classifier [21]. However, it is difficult to deal with multi-class classification tasks using MMD. In [22], the authors propose conditional generative moment-matching network (CGMMN), which exploits CMMD to build conditional moment matching networks for image classification and generation tasks. Let $X$ be an observation and $Y$ be the response. CMMD measures the discrepancy between conditional distributions $P(Y|X)$ and $P(\widehat{Y}|X)$ , in which $\widehat{Y}$ denotes the predicted value of $Y$ . By minimizing the CMMD measurement, the multi-class classification objectives can be achieved. However, it still has the following limitations. First, CGMMN uses a fixed Gaussian kernel, which limits the expressiveness of CMMD. Second, CGMMN requires the ground-truth label for each input sample, which makes it impossible to deal with semi-supervised classification tasks.

Moreover, both MMD and CMMD cannot work well on complex distributions since it is difficult to find a suitable kernel function. Long et al. [23] propose a kernel selection strategy by maximizing MMD to find a weighted coefficient of the multi-kernel function. Li et al. [17] propose to learn the kernel function by compounding of a characteristic kernel and an injective function. The same problems appear in CMMD. When dealing with the image dataset with complex background, the kernel may not effectively characterize similarity between samples, and then the classification performance based on CMMD degrades significantly. Meanwhile, since CMMD is calculated based on the embedding of conditional distributions, the kernel learning method used in MMD cannot be directly applied in CMMD. Therefore, how to design an effective and distinguishable kernel is the most important task in the CMMD-based data classification algorithms.

In this paper, we propose a kernel learning method, i.e., KLN, to tackle these problems. KLN can simultaneously learn a more representative kernel function and label prediction distribution through CMMD. The CMMD between two similar conditional distributions is minimized to find the representative kernel function, which is compounded by a pre-specified kernel function and a deep network. In order to make the learned kernel function characteristic, an additional AE structure is used to ensure that the transformation function represented by the deep network is approximately injective. KLN is mostly proposed to deal with supervised classification tasks by minimizing the CMMD between $P(Y|X)$ and $P(\widehat{Y}|X)$ , however, it can be extended to deal with semi-supervised classification tasks by using dynamically predicted labels. We evaluate KLN in a wide range of tasks, including supervised classification, visualization of the learned kernel (similarity) and semi-supervised classification. Extensive experiments are conducted on various datasets, and the results show that KLN can obtain very competitive performance.

Our contributions are summarized as follows.

1.

It is observed that the kernel functions used in CMMD cannot effectively represent similarities between sample pairs. An inappropriate kernel function will lead to degradation of the classification performance. A kernel learning method, KLN, is proposed to tackle this problem. KLN approximates the kernel matrix in a feature space mapped by a deep network, rather than the high dimensional input features.
2.

By simultaneously learning the kernel function and the label prediction distribution, an end-to-end training algorithm is proposed to improve the final classification performance. In addition, the algorithm is extended to deal with semi-supervised classification tasks.
3.

KLN is evaluated on several benchmark image datasets. It achieves competitive prediction accuracies in both supervised and semi-supervised classification tasks. The supervised prediction accuracies on the MNIST, SVHN, CIFAR-10 and CIFAR-100 datasets achieve to 99.61%, 98.44%, 94.85% and 77.37%, respectively. This is also the first CMMD-based work to achieve competitive results in semi-supervised classification scenarios.

The rest of this paper is organized as follows. Section II briefly reviews closely related work. In Section III, we first demonstrate the importance of kernel function in CMMD, and then we propose the KLN algorithm to deal with supervised and semi-supervised classification tasks. Experiment results and analysis are presented in Section IV, where KLN is compared with several state-of-the-art methods. Section V concludes the paper and discusses future work.

II Related work

Generative models belong to statistical methodology of the joint probability distribution on both observation variable and response variable. Deep generative models characterize the distribution of observations with a hierarchical architecture and many latent variables. They can be the natural choice for many tasks that require statistical inference and deep convolutional operations, such as image generation [24, 25], style transfer [17], and data classification [26, 27, 28].

Recently, Goodfellow et al. [29] present generative adversarial network (GAN) to deal with the distribution learning problem. GAN adopts a game-theory-based strategy to discriminatively learn the data generator, and formulates the objective function as a min-max optimization problem [27, 30]. However, the adversarial formalism of GAN is hard to convergent to the desired solution, as the gradients can be easily saturated and even disappear. In particular, Li et al. [16] present generative moment matching network (GMMN), which is essentially a generative version of MMD, to simplify the model by sampling from some simple distribution. To learn the network weights, kernel-based MMD is exploited to avoid unnecessary assumptions of the distributions. However, GMMN has some drawbacks in real applications. Besides the quadratic computational complexity, which is principally deduced by MMD, the gradient vanishing phenomenon frequently appears for low-bandwidth kernels. It is probably that some kernels used in practice are unsuitable for capturing very complex distances in high dimensional sample spaces such as natural images [17]. A GMMN network estimates the joint distribution of a set of variables. It seems that conditional distribution matching is more interesting in many other cases such as data classification and generation [31, 32]. Ren et al. [22] present CGMMN to learn a flexible conditional distribution when some input variables are given. CGMMN extends the capability of GMMN to address a wide range of application problems as mentioned above, while keeping the training process simple. The successful applications of CGMMN can be primarily attributed to the flexible kernel embedding of conditional probability distributions, which rely on a generalization of covariance matrix, known as the cross-covariance operator [33]. Song et al. [34] have proved that under some assumptions, the conditional embedding exists, and it can be expressed in terms of cross-covariance operator. Specifically, CGMMN attempts to learn the true category distribution by minimizing the CMMD between two conditional distributions.

Refer to caption — Figure 1: Heat map of matrix H on MNIST dataset. (a) The samples were drawn from different classes. (b) The samples were drawn from the same class. We can see that the off-diagonal elements of H are almost identical to each other, and there is almost no difference between the two matrices. It indicates that the kernel functions used in CMMD criterion fails to distinguish between-class sample similarities and within-class sample similarities.

The KLN method is inspired by CGMMN, but there are two key differences in the training process:

1.

CGMMN uses a pre-specified kernel on the raw input features directly, while KLN introduces the kernel learning process to improve the expressiveness of CMMD.
2.

CGMMN estimates CMMD on each mini-batch, which makes it necessary to have a real label for each sample. However, by simultaneously learning the kernel function and predicting the distribution by CMMD, KLN can use the unlabeled data to learn a more discriminant kernel function and estimate the CMMD between the true conditional distribution and the predicted distribution. This enables it to deal with semi-supervised classification tasks effectively.

III The KLN Algorithm

In this section, the KLN method is introduced in details to learn more expressive kernel functions and improve the classification performance. Roughly speaking, KLN learns discriminative features in an end-to-end manner by integrating CMMD and an AE framework. In particular, the kernel-based similarities are obtained by using the embedding features, rather than the input raw features. Thus, the kernel matrices are not unity in the training process and they make final features more adaptive and discriminative.

III-A Kernel degradation in CMMD

The principal problem that pattern classification concerns is whether conditional distribution equation $P(Y|X)=P(\widehat{Y}|X)$ holds or not. Intuitively, CMMD is suitable for solving the pattern classification task since it can capture the discrepancy between two conditional distributions. Suppose that $\phi$ and $\psi$ denote the nonlinear mapping functions for $X$ and $Y$ , respectively. Then cross-covariance $C_{XY}:\mathcal{G}\rightarrow\mathcal{F}$ is defined as [33]:

C_{XY}=\mathbb{E}_{XY}[\phi(X)\otimes\psi(Y)]-\mu_{X}\otimes\mu_{Y},

where $\otimes$ is the tensor product operator. The embedding of conditional distributions $C_{Y|X}$ can be defined as [34]:

C_{Y|X}=C_{YX}C_{XX}^{-1}.

(1)

Given a dataset $\mathcal{D}_{XY}$ drawn $i.i.d$ from $P(X,Y)$ , $C_{Y|X}$ can be estimated by:

	$\displaystyle\widehat{C}_{Y\|X}$	$\displaystyle=$	$\displaystyle\Psi(\Phi^{\top}\Phi+\lambda\textbf{I})^{-1}\Phi^{\top}$		(2)
		$\displaystyle\triangleq$	$\displaystyle\Psi(\mathcal{K}+\lambda\textbf{I})^{-1}\Phi^{\top},$		(2)

where $\Phi$ denotes the embedding of $X$ , $\Psi$ denotes the embedding of $Y$ , $\mathcal{K}=\Phi^{\top}\Phi$ is the Gram matrix, and $\lambda$ is a positive regularization parameter.

We randomly sample two batches, e.g., $\mathcal{D}_{XY}^{s}=\{(\textbf{x}_{i}^{s},y_{i}^{s})\}_{i=1}^{n}$ and $\mathcal{D}_{X\hat{Y}}^{t}=\{(\textbf{x}_{i}^{t},\widehat{y}_{i}^{t})\}_{i=1}^{n}$ , from $P(Y|X)$ and $P(\widehat{Y}|X)$ , respectively. Let $\textbf{X}_{s}=[\textbf{x}_{1}^{s},\cdots,\textbf{x}_{n}^{s}]$ and $\textbf{X}_{t}=[\textbf{x}_{1}^{t},\cdots,\textbf{x}_{n}^{t}]$ , and the superscripts or subscripts $s$ and $t$ represent the two sample sets, respectively. $\hat{y}_{i}^{t}$ is the predicted label of $x_{i}^{t}$ , and it is derived by some specified prediction method such as softmax. It is worth noting that $\textbf{X}_{s}$ and $\textbf{X}_{t}$ may be non-overlapping, because in some applications, e.g., semi-supervised classification tasks, there are some unlabeled input data. The empirical estimation of CMMD is

$\displaystyle L_{\rm{CMMD}}=$	$\displaystyle\\|\widehat{C}_{Y\|\textbf{X}}^{s}-\widehat{C}_{\widehat{Y}\|\textbf{X}}^{t}\\|_{\mathcal{F\otimes G}}^{2}$	(3)
$\displaystyle=$	$\displaystyle\\|\Psi_{s}(K_{s}+\lambda\textbf{I})^{-1}\Phi_{s}^{\top}-\Psi_{t}(K_{t}+\lambda\textbf{I})^{-1}\Phi_{t}^{\top}\\|_{\mathcal{F\otimes G}}^{2}$
$\displaystyle=$	$\displaystyle\mbox{Tr}(K_{s}\widetilde{K}_{s}^{-1}\mathcal{L}_{s}\widetilde{K}_{s}^{-1})+\mbox{Tr}(K_{t}\widetilde{K}_{t}^{-1}\mathcal{L}_{t}\widetilde{K}_{t}^{-1})$
	$\displaystyle-2\cdot\mbox{Tr}(K_{ts}\widetilde{K}_{s}^{-1}\mathcal{L}_{st}\widetilde{K}_{t}^{-1}),$

where $\Psi_{s}=[\psi(y_{1}^{s}),\cdots,\psi(y_{n}^{s})]$ , $\Phi_{s}=[\phi(\textbf{x}_{1}^{s}),\cdots,\phi(\textbf{x}_{n}^{s})]$ , $K_{s}=\Phi_{s}^{\top}\Phi_{s}$ , $\widetilde{K}_{s}=K_{s}+\lambda\textbf{I}$ , $\mathcal{L}_{s}=\Psi_{s}^{\top}\Psi_{s}$ . Other variables relating to subscribe $t$ are defined in a similar way on dataset $\mathcal{D}_{X\hat{Y}}^{t}$ . By minimizing this CMMD-based loss function, the difference between $P(Y|X)$ and $P(\widehat{Y}|X)$ is reduced to a reasonable range. In particular, according to Theorem 3 shown in [22], when CMMD approaches its minimum value of 0, the predicted distribution converges to the true distribution. It means that the CMMD criterion can be used to learn the category distribution.

However, CMMD fails to measure the similarity between two conditional distributions if the kernel function is selected improperly. It is well known that the performance of MMD is limited by the quality of the kernel function [35, 36]. When dealing with complex distributions, it is difficult to effectively represent similarity between samples using the RBF kernel and the Laplacian kernel, which lead to the degradation of MMD-based algorithms. The same problem exists in the CMMD criterion. We present a demonstrative instance here. We randomly select a sample batch and plot the heat map of the matrix $\tilde{K}^{-1}K\tilde{K}^{-1}$ , which is the weight matrix of $\mathcal{L}$ in Equation (3) and is called H here for convenience. Figure 1 shows the H matrix by using the samples with random sampling. The off-diagonal elements of H are almost the same. Besides, we randomly select a sample batch from one single class, and then plot the heat map of matrix H in Figure 1. There is almost no difference between the two matrices. It indicates that the kernel function used in CMMD criterion fails to distinguish the between-class similarities and the within-class similarities when dealing with complex distributions. The heat maps of H matrix on the SVHN and CIFAR-10 datasets show very similar comparison results, and they are presented in Appendix A.

In summary, kernel function learning module plays an important role in characterizing the loss function $L_{\rm{CMMD}}$ . If the kernel functions are not learned well, then CMMD loses its intrinsic discriminative representation ability. In contrary, well-learned kernel functions are useful to extract robust and discriminative features for classification. Therefore, a better kernel function is expected to improve the classification performance of the CMMD criterion.

III-B Kernel learning on network embedding features

In classification tasks, it is desirable that distances between samples with the same label are small in the learned metric space, while distances between those with different labels are large. However, as shown in Figure 1, when the nonlinear kernel functions are directly applied to input features $\textbf{X}_{s}$ and $\textbf{X}_{t}$ , the RBF kernel function tends to derive a similar Gram matrix for the high-dimensional and normalized data. In this case, CMMD cannot characterize the weights of errors well, thus the classification performance degenerates rapidly.

By Equation (3), the distance $L_{\rm{CMMD}}$ of the two conditional distributions $P(Y|X)$ and $P(\tilde{Y}|X)$ can be estimated by sampling finite samples. Due to sampling bias, $L_{\rm{CMMD}}$ may not be zero (but should be close to 0) when $P(Y|X)=P(\tilde{Y}|X)$ . When $L_{\rm{CMMD}}$ is a large value, the pre-specified kernel function may be inappropriate. Intuitively, if a kernel $K$ gets a small $L_{\rm{CMMD}}$ when $P(Y|X)=P(\tilde{Y}|X)$ , it is likely to describe the similarity better between samples. So, instead of directly using a pre-specified kernel $K$ for the input features X, we expect to obtain a more suitable kernel function via

\min_{K}\|\widehat{C}_{Y|\textbf{X}}^{s}-\widehat{C}_{Y|\textbf{X}}^{t}\|_{\mathcal{F\otimes G}}^{2}.

(4)

In Equation (4), all characteristic kernels should be considered. However, it is difficult to choose the optimal solution from all the characterized kernels. According to the theory proposed by [35], if function $h$ is injective and the kernel function $K$ is characteristic, then the compound kernel function $\mathcal{K}=K\circ h$ , which satisfies to $\mathcal{K}(x,x^{{}^{\prime}})=K(h(x),h(x^{{}^{\prime}}))$ , is still characteristic. In this paper, the optimal solution of Equation (4) is determined via selecting from a series of kernel functions $\mathcal{K}$ . In other words, the kernel function is not applied directly to the raw data $X$ , but instead applied to the embedded feature $Z$ with an $h$ transformation, which is represented by a deep network. Here an injective function $h_{\textbf{w}}$ with parameter w is considered. Let $\textbf{Z}=h_{\textbf{w}}(\textbf{X})$ . We use subscript $\_z$ to denote the features referring to Z and to distinguish them from raw features X. Then the kernel learning loss by the CMMD criterion in the new model can be defined as:

	$\displaystyle L_{\rm{CMMD\_Z}}(\textbf{w})$	(5)
$\displaystyle=$	$\displaystyle\\|\widehat{C}_{Y_{s}\|h_{\textbf{w}}(\textbf{X}_{s})}-\widehat{C}_{Y_{t}\|h_{\textbf{w}}(\textbf{X}_{t})}\\|_{\mathcal{F\otimes G}}^{2}$
$\displaystyle=$	$\displaystyle\\|\widehat{C}_{Y_{s}\|\textbf{Z}_{s}}-\widehat{C}_{Y_{t}\|\textbf{Z}_{t}}\\|_{\mathcal{F\otimes G}}^{2}$
$\displaystyle=$	$\displaystyle\\|\Psi_{s}(K_{s\_z}+\lambda\textbf{I})^{-1}\Phi_{s\_z}^{\top}-\Psi_{t}(K_{t\_z}+\lambda\textbf{I})^{-1}\Phi_{t\_z}^{\top}\\|_{\mathcal{F\otimes G}}^{2}$
$\displaystyle=$	$\displaystyle\mbox{Tr}(K_{s\_z}\widetilde{K}_{s\_z}^{-1}\mathcal{L}_{s}\widetilde{K}_{s\_z}^{-1})+\mbox{Tr}(K_{t\_z}\widetilde{K}_{t\_z}^{-1}\mathcal{L}_{t}\widetilde{K}_{t\_z}^{-1})$
$\displaystyle-$	$\displaystyle 2\cdot\mbox{Tr}(K_{ts\_z}\widetilde{K}_{s\_z}^{-1}\mathcal{L}_{ts}\widetilde{K}_{t\_z}^{-1}),$

in which $\Psi_{s}=[\psi(y_{1}^{s}),\cdots,\psi(y_{n}^{s})]$ , $\mathcal{L}_{s}=\Psi_{s}^{\top}\Psi_{s}$ , $\Phi_{s\_z}=[\phi(\textbf{z}_{1}^{s}),\cdots,\phi(\textbf{z}_{n}^{s})]=[\phi(h_{\textbf{w}}(\textbf{x}_{1}^{s})),\cdots,\phi(h_{\textbf{w}}(\textbf{x}_{n}^{s}))]$ , $K_{s\_z}=\Phi_{s\_z}^{\top}\Phi_{s\_z}$ , and $\widetilde{K}_{s\_z}\!\!=\!\!K_{s\_z}\!+\!\lambda\textbf{I}$ . Other variables relating to subscribe $t$ are defined in a similar way on $\mathcal{D}_{XY}^{t}$ . $K_{ts\_z}=\Phi_{t\_z}^{\top}\Phi_{s\_z}$ , $\mathcal{L}_{st}=\Psi_{s}^{\top}\Psi_{t}$ . By minimizing $L_{\rm{CMMD\_Z}}(\textbf{w})$ , the similarity value of positive pair (i.e., in the same class) becomes larger, while the similarity value of negative pair (i.e., in different classes) becomes smaller.

In addition, to ensure that $h_{\textbf{w}}$ is injective or approximately injective, an auto-encoder (AE) module is designed in KLN. For an injective function $h$ , there exists an inverse function $h^{-1}$ such that $h^{-1}(h(\textbf{x}))=\textbf{x}$ , $\forall\textbf{x}\in\mathcal{X}$ , which can be approximated by an AE. In the following, we use $\textbf{w}=\{\textbf{w}_{e},\textbf{w}_{d}\}$ to denote the network parameters of AE, in which $\textbf{w}_{e}$ denotes the encoder parameters and $\textbf{w}_{d}$ the decoder parameters. Correspondingly, $h_{\textbf{w}_{e}}$ is the encoder and $h_{\textbf{w}_{d}}$ is the decoder. The reconstruction loss of AE can be defined as:

\displaystyle L_{ae}(\textbf{w})=\mathbb{E}_{\textbf{x}\in\mathcal{X}}\|\textbf{x}-h_{\textbf{w}_{d}}(h_{\textbf{w}_{e}}(\textbf{x}))\|^{2}.

(6)

In this AE architecture, decoder satisfies to $h_{\textbf{w}_{d}}\approx h_{\textbf{w}_{e}}^{-1}$ when the reconstruction loss is minimized, which ensures that $h_{\textbf{w}_{e}}$ is approximately injective.

Taking the AE regularization module into account, the objective function of KLN is formulated as

\displaystyle\underset{\textbf{w}}{\min}\>\>L_{\rm{CMMD\_Z}}(\textbf{w}_{e})+\beta L_{ae}(\textbf{w}).

(7)

The first term is the CMMD criterion based on the new kernel $\mathcal{K}=K\circ h_{w_{e}}$ , and the second term is the expected error from the AE, which can be viewed as a regularization term to ensure that the learned kernel function is characteristic. The trade-off positive parameter $\beta$ is predefined by users.

III-C Kernel learning for supervised classification tasks

Algorithm 1 KLN for Supervised Classification Tasks

0: Dataset

\mathcal{D}=\{(\textbf{x}_{i},y_{i})\}_{i=1}^{n}

0: Learned parameters

\textbf{w}=\{\textbf{w}_{e},\textbf{w}_{d}\},\textbf{w}_{fc}

1: Copy

\mathcal{D}

into subsets

\mathcal{D}_{s}

and

\mathcal{D}_{t}

2: while Stopping criterion not met do

3: Draw a mini-batch from

\mathcal{D}_{s}

and

\mathcal{D}_{t}

, and denote them as

\mathcal{B}_{s}

and

\mathcal{B}_{t}

, respectively;

4: For each

\textbf{x}\in\mathcal{B}_{t}

, generate

\hat{y}

; and set

\mathcal{B}^{{}^{\prime}}_{t}

to contain all the generated

(\textbf{x},\hat{y})

;

5: Solve the gradient of CMMD loss w.r.t.

\textbf{w}_{e},\textbf{w}_{fc}

\mathcal{B}_{s}

and

\mathcal{B}^{{}^{\prime}}_{t}

;

6: Solve the gradient of AE loss w.r.t. w on

\mathcal{B}_{s}

and

\mathcal{B}^{{}^{\prime}}_{t}

;

7: Update

\textbf{w},\textbf{w}_{fc}

using the gradient of Eq. (9);

8: end while

In the previous section, KLN is proposed based on the CMMD criterion and the AE framework. In order to deal with supervised classification tasks, a classification layer can be attached onto top of the hidden layer. As shown in Figure 2, a fully connected layer with softmax activation is connected to the hidden layer. The label prediction distribution and the kernel function are learned simultaneously. In particular, the CMMD criterion is adopted to measure prediction errors. The CMMD between the prediction distribution $P(\widehat{Y}|X)$ and the true distribution $P(Y|X)$ can be estimated as:

\displaystyle L_{\rm{CMMD\_Z}}(\widehat{Y})=\|\widehat{C}_{Y_{s}|h_{\textbf{w}_{e}}(\textbf{X}_{s})}-\widehat{C}_{\widehat{Y}_{t}|h_{\textbf{w}_{e}}(\textbf{X}_{t})}\|_{\mathcal{F\otimes G}}^{2}.

(8)

Comparing Equations (5) and (8), $L_{\rm{CMMD\_Z}}(\widehat{Y})$ is an effective approximation of $L_{\rm{CMMD\_Z}}(\textbf{w}_{e})$ when $P(\widehat{Y}|X)\approx P(Y|X)$ . In order to reduce the computational complexity, $L_{\rm{CMMD\_Z}}(\widehat{Y})$ , instead of $L_{\rm{CMMD\_Z}}(\textbf{w}_{e})$ , is used to learn the kernel function. In this case, the CMMD distance is calculated only once. Let $\textbf{w}_{fc}$ represent the network parameters of the fully connected layer, the objective of KLN for supervised classification tasks is modified as follows,

\displaystyle\underset{\textbf{w},\textbf{w}_{fc}}{\min}\>\>L_{\rm{CMMD\_Z}}(\widehat{Y})+\beta L_{ae}(\textbf{w}).

(9)

We call the first term as CMMD loss, and the next term as AE regularization term. In Equation (9), the CMMD loss is used to learn the label prediction distribution and the kernel function simultaneously, and the AE regularization term is used to ensure that the learned kernel function is characteristic.

Figure 2 shows the flowchart of our KLN method. KLN uses the encoder to map the high-dimensional raw data X to the latent feature Z, which can be passed through the decoder to reconstruct X. To predict label $Y$ , the latent variable Z is propagated to the loss layer via a non-linear transformation with FC-layers. Main steps of KLN for supervised classification tasks are summarized and then shown in Algorithm 1.

When the training process shown in Algorithm 1 is completed, we can obtain the prediction distribution $P(\hat{Y}|X)$ , which is represented by the encoder of AE and the fully connected layer. When testing a new sample $\textbf{x}_{test}$ , it is first input to the encoder to get the latent feature $\textbf{z}_{test}$ , which is then input to the fully connected layer to get the predicted label $\hat{\textbf{y}}_{test}$ .

III-D Kernel learning for semi-supervised classification tasks

We now consider adapting the formulation from Section III-C to the semi-supervised setting. Let $\textbf{X}=[\textbf{x}_{1},\cdots,\textbf{x}_{N}]$ be a set of all $N$ examples and $\textbf{X}^{l}=[\textbf{x}_{1}^{l},\cdots,\textbf{x}_{L}^{l}]$ be a set of $L$ labeled examples. In particular, $\textbf{X}^{l}\subset\textbf{X}$ . For every $\textbf{x}_{i}\in\textbf{X}^{l}$ , the ground-truth label $y_{i}\in{1,\cdots,C}$ is known, where $C$ is the number of classes.

In Equation (8), $\textbf{X}_{s}$ needs its true label and $\textbf{X}_{t}$ can use the predicted label, so $\textbf{X}_{s}$ can be sampled from $\textbf{X}^{l}$ and $\textbf{X}_{t}$ from X to estimate the CMMD loss. It means that in semi-supervised classification tasks, the CMMD loss can simultaneously use a small amount of labeled data and a large amount of unlabeled data to learn a more suitable kernel function. In addition, to make better use of the unlabeled data, we also introduce the confidence loss [27] for these data, which is commonly used in the semi-supervised learning models. The confidence loss $L_{con}$ minimizes the conditional entropy of $P(\widehat{Y}|\textbf{X})$ and the cross entropy between $P(Y)$ and $P(\widehat{Y})$ . Accordingly, the objective of KLN for semi-supervised classification tasks is presented as follows,

\displaystyle\underset{\textbf{w},\textbf{w}_{fc}}{\min}\>\>L_{\rm{CMMD\_Z}}(\widehat{Y})+\beta_{1}L_{ae}(\textbf{w})+\beta_{2}L_{con}(\textbf{w}_{e},\textbf{w}_{fc}),

(10)

where $\beta_{1}$ and $\beta_{2}$ are trade-off parameters. Comparing with the training process for supervised tasks, the optimization process and the sampling process are different. Main steps of KLN for semi-supervised classification tasks are summarized and then shown in Algorithm 2.

Algorithm 2 KLN for Semi-Supervised Classification Tasks

0: Datasets

\mathcal{D}=\{(\textbf{x}_{i})\}_{i=1}^{N}

\mathcal{D}^{l}=\{(\textbf{x}_{i}^{l},y_{i}^{l})\}_{i=1}^{L}

0: Learned parameters

\textbf{w}=\{\textbf{w}_{e},\textbf{w}_{d}\},\textbf{w}_{fc}

1: while Stopping criterion not met do

2: Draw a mini-batch from

\mathcal{D}^{l}

and

\mathcal{D}

, and denote them as

\mathcal{B}_{s}

and

\mathcal{B}_{t}

, respectively;

3: For each

\textbf{x}\in\mathcal{B}_{t}

, generate

\hat{y}

; and set

\mathcal{B}^{{}^{\prime}}_{t}

to contain all the generated

(\textbf{x},\hat{y})

;

4: Solve the gradients of CMMD loss and AE regularization term w.r.t.

\textbf{w},\textbf{w}_{fc}

\mathcal{B}_{s}

and

\mathcal{B}^{{}^{\prime}}_{t}

;

5: Solve the gradient of Confidence loss w.r.t.

\textbf{w}_{e},\textbf{w}_{fc}

\mathcal{B}^{{}^{\prime}}_{t}

;

6: Update

\textbf{w},\textbf{w}_{fc}

using the gradient of Eq. (10);

7: end while

KLN learns discriminative features in an end-to-end manner. The loss functions (9) and (10) can be optimized by the Back-Propagation algorithm, which can be implemented automatically and efficiently by several deep learning frameworks including TensorFlow and PyTorch. Thus, the derivative details are omitted here. More implementation details will be discussed in the next section.

IV Experiment results and analysis

In this section, classification performance of KLN was evaluated from several views. The accuracy was firstly tested and compared with other methods in supervised classification tasks. Then several ablation experiments were conducted to show the importance of kernel learning. The histograms of kernel values learned by KLN are also shown. Finally, the classification results of KLN in semi-supervised classification tasks were tested and compared with other methods.

IV-A Datasets and experimental setup

Four benchmark datasets, i.e., MNIST [37], SVHN [38], CIFAR-10 and CIFAR-100 [39], were used for evaluating the image classification performance in this section. Some exemplar instances are shown in Figure 3.

MNIST handwritten digits set is one of the most popular datasets used in the deep learning literature. It has 70,000 images of ten classes (from 0 to 9). Each sample is a 28 $\times$ 28 gray image. The whole dataset was divided into three non-overlapping parts: one training set of 50,000 samples, one validation set of 10,000 samples, and one test set of 10,000 samples.

SVHN is a more complex dataset than MNIST as it has more samples and is closer to real scenes. Each sample is a $3\times 32\times 32$ color image. All the images were captured from house numbers in Google street views. The whole set is also composed of three different subsets, namely, one training set consisting of 73,257 samples, one testing set consisting of 26,032 samples, and one extra training set consisting of 531,131 samples.

CIFAR-10 is another image dataset. It consists of ten-class images including airplanes, cars, cats, and so on. Each class has 6,000 color images of size $3\times 32\times 32$ . The whole set was divided into two separate parts, i.e., one training set of 50,000 images, and one test set of 10,000 images.

CIFAR-100 contains 60,000 color images of size $3\times 32\times 32$ drawn from 100 classes. The dataset was split into 50,000 training images and 10,000 test images.

All experiments were implemented by PyTorch Toolbox, and ran on a PC equipped with a NVIDIA TiTan X GPU and 32G RAM.

Data processing: For MNIST, no preprocessing was applied to the data except for scaling to the numerical range $[0,1]$ . SVHN and CIFAR datasets require more techniques for image preprocessing. In addition to the global normalization, a series of data augmentation methods including random cropping and horizontal flipping were applied.

Network architecture: On the MNIST dataset, we followed the architecture of DCGAN [25] to design the network architecture of KLN. We replaced the pooling layers with strided convolutions, and used batch-wise normalization and LeakyReLU activation in the auto-encoder. On the SVHN, CIFAR-10 and CIFAR-100 datasets, we followed the architecture in [40]. More specific parameter settings are placed in Appendix B.

Hyper-parameters: For model optimization, the batch size was set to 100 for all methods. For supervised classification tasks, SGD algorithm with initial learning rate 0.02, momentum 0.9 and weight decay 0.0005 was used. On CIFAR and MNIST, the learning rate reduces by 0.2 at 50, 100, 130 epochs, and all the networks were trained by a total of 150 epochs. On SVHN, the learning rate reduced by 0.2 at 10, 20, 30 epochs, and the training procedure finished in 40 epoches. For semi-supervised classification tasks, ADAM algorithm [41] with learning rate 1 $e$ -3 and the two moment terms $\gamma_{1}=0.9$ , $\gamma_{2}=0.99$ was used. The dimensionality $d$ of the latent variable Z is $d=128$ . The CMMD loss was calculated by using multiple Gaussian kernel functions with bandwidth parameters [1, 3, 5, 7, 9]. For the weight hyper-parameters, we simply set $\beta=0.1$ in supervised classification tasks and $\beta_{1}=0.1,\beta_{2}=1$ in semi-supervised classification tasks.

IV-B Experiment results of supervised classification

This section compares KLN with some state-of-the-art supervised classification methods, which include VA+Pegasos [42], MMVA [42], Stochastic Pooling [43], Network in Network (NIN) [6], Maxout Network [44], DSN [45], ResNet [46] and so on. For a fair comparison, the same network architecture was used to evaluate the performance of the most recent method CGMMN-CNN. For the rest methods, the best results are from the original work.

Table I shows the classification results of MNIST. KLN was trained with a total of 150 epochs. The encoder and decoder of AE used a 4-layer fully convolutional network and a 4-layer de-convolutional network, respectively. The results show that KLN is competitive with various state-of-the-art competitors, e.g., CMMVA and DSN. DSN obtains the minimum error rate as it benefits from using more supervised information in every hidden layer, and KLN gets the same classification performance as DSN. In particular, we pay our main attention on comparing the results of CGMMN and KLN. Two types of architectures of CGMMN, as stated in [22], i.e., CGMMN-MLP and CGMMN-CNN, are evaluated and then analyzed here. The results of CGMMN-MLP and CGMMN-CNN are 0.97% and 0.45%, respectively, while the accuracy of KLN is 0.39%. Thus, KLN outperforms CGMMN-MLP and CGMMN-CNN.

TABLE I: Comparison with state-of-the-art methods on MNIST.

Method	Error rate (%)
VA+Pegasos [42]	1.04
MMVA [42]	0.90
CVA + Pegasos [42]	1.35
Stochastic Pooling [43]	0.47
NIN [6]	0.47
Maxout Network [44]	0.45
CMMVA [42]	0.45
DSN [45]	0.39
CGMMN-MLP [22]	0.97
CGMMN-CNN [22]	0.45
KLN	0.39

Table II shows the classification results of SVHN. KLN was trained with a total of 40 epochs. The encoder and decoder of AE used a 9-layer convolutional network and a 4-layer de-convolutional network, respectively. The results of CNN and CMMVA are 4.9% and 3.09%, respectively. DSN still obtains a lower error rate as it benefits from using more supervised information in every hidden layer. The results of DSN, CGMMN-CNN and KLN are 1.92%, 2.01% and 1.56%, respectively. Thus, KLN outperforms both DSN and CGMMN, and the effectiveness of algorithmic improvement is validated again.

TABLE II: Comparison with state-of-the-art methods on SVHN.

Method	Error rate (%)
CVA+Pegasos [42]	25.3
CNN [47]	4.90
CMMVA [42]	3.09
Stochastic Pooling [43]	2.80
NIN [6]	2.47
Maxout Network [44]	2.35
DSN [45]	1.92
CGMMN-CNN [22]	2.01
KLN	1.56

Table III shows the classification results of CIFAR-10. KLN was trained with a total of 150 epochs. The encoder and decoder of AE used a 9-layer convolutional network and a 4-layer de-convolutional network, respectively. All the error rates of CNN, NIN, Maxout Network, and DSN are larger than 7.0%. The results of FitNet4-LSUV, ResNet-110 and CGMMN-CNN are 6.06%, 6.43% and 6.61%, respectively, while that of KLN is 5.15%. Therefore, KLN outperforms both DSN and CGMMN-CNN on the CIFAR-10 dataset, and the effectiveness of algorithmic improvement is validated again.

TABLE III: Comparison with state-of-the-art methods on CIFAR10.

Method	Error rate (%)
ALL-CNN [48]	7.25
NIN [6]	8.81
Maxout Network [44]	9.38
DSN [45]	7.97
FitNet4-LSUV [49]	6.06
ResNet-110 [46]	6.43
CGMMN-CNN [22]	6.61
KLN	5.15

Table IV shows the results of CIFAR-100. KLN was trained with the same parameter settings as those of CIFAR-10. All the error rates of CNN, NIN, Maxout Network, and DSN are larger than 30.0%. The results of FitNet4-LSUV, ResNet-110 and CGMMN-CNN are 27.66%, 25.16% and 24.84%, respectively, while that of KLN is 22.63%. Compared with the results of CGMMN method, KLN has more than two percent increments.

TABLE IV: Comparison with state-of-the-art methods on CIFAR100.

Method	Error rate (%)
ALL-CNN [48]	33.71
NIN [6]	35.68
Maxout Network [44]	34.54
DSN [45]	34.57
FitNet4-LSUV [49]	27.66
ResNet-110 [46]	25.16
CGMMN-CNN [22]	24.84
KLN	22.63

In Equation (9), the weight hyper-parameter $\beta$ is introduced to control the weight of the AE regularization term. We further evaluated the susceptibility of $\beta$ . Several optional $\beta$ values were pre-specified in the set $\{10,1,0.1,0.01,0.001\}$ . Ablation experiments were conducted to understand real impact of the AE regularization term, which corresponds to $\beta=0$ . Figure 4 shows the classification results on SVHN and CIFAR-10 for demonstration. The results of KLN are insensitive to the value of $\beta$ , and the model achieves slightly better performance when $\beta$ is set to 0.1. It is inferred that KLN usually learns the injective mapping with high probability when learning a more discriminative kernel function and parameterizing with deep neural networks.

TABLE V: TRAINING TIME (s) OF CGMMN AND KLN.

Method	MNIST	SVHN	CIFAR-10	CIFAR-100
CGMMN	26.22	453.90	37.28	43.87
KLN	62.68	907.32	75.20	81.33

The training time of KLN and CGMMN in one epoch is shown in Table V. In principle, the optimization procedure of CMMD contains two separate but related modules, one from the output layer to the latent layer, and the other from the latent layer to the input layer. Both of them can be learned efficiently by the standard Back-Propagation process. Compared with CGMMN, KLN has an additional AE module. Fortunately, the decoder of the AE is mainly composed of de-convolutional operations, which correspond to the convolutional layers used by the encoder. In other words, the AE does not take much time in optimization. It is worth noting that two different subsets were used in each epoch in KLN, while just one subset was used by CGMMN. Thus, KLN costs twice time as much as CGMMN. All the results of the above experiments validate this.

IV-C Importance of kernels learning

In order to evaluate the importance of kernel learning module in KLN, we set up three groups of controlled experiments, which corresponded to three commonly used mapping functions, i.e., identity mapping, PCA and AE respectively, to construct new kernel functions.

TABLE VI: Prediction accuracy (%) of CMMD with different functions

h

on three benchmark datasets.

Dataset	Identity Mapping	PCA+CMMD	AE+CMMD	KLN
MNIST	81.62	88.64	89.32	99.61
SVHN	19.59	30.23	35.23	98.44
CIFAR-10	14.25	27.75	33.12	94.85

A new kernel $\mathcal{K}=k\circ h$ , which composes of a mixture of Gaussian kernels and different embedding functions $h$ , is considered in this section. When a kernel function is composed with the identity mapping, it means that the kernel matrix is calculated directly from the input data. For the PCA+CMMD and AE+CMMD methods, PCA and AE should be trained additionally, and then the PCA subspace and the encoder of AE are used as the embedding function separately. The critical difference between these methods and KLN is the network learning manner. KLN needs to learn an appropriate kernel function by simultaneously learning a classification model in an end-to-end manner. For a fair comparison, the same network architecture was used to implement these experiments on three datasets. Classification results evaluated on the test sets are shown in Table VI.

The classification accuracies of KLN on the three datasets are 99.61%, 98.44%, and 94.85%, respectively. As is shown above, these results are comparable to and even better than those of the state-of-the-art methods. In contrary, corresponding accuracies of using identity mapping are just 81.62%, 19.59% and 14.25%. Obviously, results of KLN are far better than those obtained from identity mapping. It indicates that KLN characterizes the similarity well. The classification accuracies of PCA+CMMD and AE+CMMD on the MNIST are 88.64% and 89.32%, respectively, while there are only about 30% classification accuracies on more complex datasets. It indicates the importance of learning a more effective kernel under complex data distributions.

IV-D Separating ability of kernels

In Section III-A, we have shown that distributions of $\langle\phi(\textbf{x}_{i}),\phi(\textbf{x}_{j})\rangle$ on samples with the same label or with different labels have no significant difference, thus, the kernel functions used in the CMMD criterion cannot effectively represent similarity. In this paper, a new approach is proposed to extract features via a different kernel utilization manner. Specifically, the new kernel $\mathcal{K}=k\circ h$ , in which $k$ is a pre-specified characteristic kernel function and $h$ is the encoder of an auto-encoder network, is exploited in our method. It is expected that the distributions of $\langle\phi(\textbf{z}_{i}),\phi(\textbf{z}_{j})\rangle$ can be well distinguished.

TABLE VII: Predictive error rates (

\%

) on MNIST, SVHN and CIFAR-10 datasets with

n

labeled examples, averages of 10 runs.

Algorithm	MNIST( $n=100$ )	SVHN( $n=1000$ )	CIFAR-10( $n=4000$ )
DGN [50]	3.33 $\pm 0.14$	36.02 $\pm 0.10$	-
CatGAN [27]	1.39 $\pm 0.28$	-	19.58 $\pm 0.58$
Ladder network [51]	1.06 $\pm 0.37$	-	20.40 $\pm 0.47$
ADGM [52]	0.96 $\pm 0.02$	22.86	-
SDGM [52]	1.32 $\pm 0.07$	16.61 $\pm 0.24$	-
Improved-GAN [53]	0.93 $\pm 0.07$	8.11 $\pm 1.3$	18.63 $\pm 2.32$
Triple-GAN [54]	0.91 $\pm 0.58$	5.77 $\pm 0.17$	16.99 $\pm 0.36$
KLN	0.89 $\pm$ 0.05	5.51 $\pm$ 0.25	16.79 $\pm$ 0.17

We conducted experiments on three datasets, and then presented histograms of kernel values with the same or different labels in Figure 5. Taking the MNIST dataset for example, Figure 5(a) shows histogram for the samples with different labels while Figure 5(d) shows histogram for the samples with the same label. It can be seen that the kernel values, i.e., $\langle\phi(\textbf{z}_{i}),\phi(\textbf{z}_{j})\rangle$ of samples with different labels, are mainly concentrated in the range of 0.1-0.3, while those with the same label are concentrated between 0.7 and 1.0. There is a clear distinction between these two distributions. It indicates that the kernel values learned by KLN have better separating ability. Thus, the effectiveness of KLN in learning more expressive kernel functions is validated.

Results on SVHN and CIFAR-10 present a similar performance. In particular, for SVHN, the kernel values of samples with different labels are mainly concentrated on a sharp neighborhood of 0.0, while those with the same label are concentrated on the range of 0.6-0.9. For CIFAR-10, the kernel values of samples with different labels are mainly concentrated on a sharp neighborhood of 0.1, while those with the same label are concentrated on the range of 0.6-0.9.

We also conducted a group of contrastive experiments to display the histograms of kernel values computed by using the raw input features. The histograms are shown in Appendix C. Specifically, Figures 9, 10 and 11 display the histograms of kernel values on the MNIST, SVHN and CIFAR-10, respectively. All these results indicate that the kernel values, which are directly computed by using the raw features, have little discriminative power in practice.

IV-E Experiment results of semi-supervised classification

This section evaluates KLN on semi-supervised classification tasks on MNIST, SVHN and CIFAR-10, and compares the performance with some state-of-the-art semi-supervised methods, including DGN [50], CatGAN [27], Ladder network [51], ADGM [52], SDGM [52], Improved-GAN [53], Triple-GAN [54].

Following the widely used settings, we used 100, 1000 and 4000 labeled samples on these datasets, respectively. The data preprocessing approaches and network architectures are the same with those used in the supervised learning scenario.

We repeated KLN over 10 runs with different random initializations and random subsets of labeled data, and calculated the mean error rates with the standard deviations following [53]. For a fair comparison, all the results of the baselines are cited from original papers. Table VII summarizes the quantitative results. KLN achieves the state-of-the-art predictive accuracies on all these datasets. Note that for a fair comparison with other algorithms, unlike the supervised learning scenario, the extra unlabeled data on SVHN were not used in training.

KLN has two important hyper-parameters, i.e., $\beta_{1}$ and $\beta_{2}$ , in the semi-supervised learning algorithm. We further evaluated the susceptibility of KLN to these hyper-parameters. These two trade-off parameters were selected from sets $\{10,1,0.1,0.01\}$ . Specifically, ablation experiments were conducted to understand the impact of different components in Equation (10), which corresponds to $\beta_{1}=0$ and $\beta_{2}=0$ , respectively. Figure 6 shows the experiment results on SVHN and CIFAR-10. Parameter $\beta_{1}$ is the weight of AE regularization term, and the results shown in Figure 6(a) indicate that KLN is insensitive to the value of $\beta_{1}$ . The same phenomenon exists in supervised classification tasks. It can be inferred that KLN usually learns the injective mapping with high probability while learning a more discriminative kernel function and parameterizing with deep neural networks. Parameter $\beta_{2}$ is the weight of the confidence loss and it could impact the performance of semi-supervised classification significantly. The comparison results are shown in Figure 6(b). The best choice of $\beta_{2}$ is 1. When the confidence loss was removed, i.e., $\beta_{2}=0$ , the classification performance dropped but still had competitive results with [53]. It indicates the advantages of KLN and importance of the confidence loss.

V Conclusion

In this paper, image classification methods with the CMMD criterion, which measures difference between two conditional distributions, are studied in depth. It is observed that the classification performance based on CMMD tends to degenerate when the kernel function lacks expressiveness. It means that the kernel functions play an important role in deciding whether classification performance degeneration emerges or not.

The KLN method is proposed to tackle these problems. On one hand, KLN can learn more discriminative kernel functions by minimizing the CMMD criterion and the AE regularization term. On the other hand, KLN can learn kernel functions and label prediction function simultaneously, thus it can be used to deal with classification tasks effectively. KLN is evaluated on four benchmark databases. The results show that it can improve the separating ability of kernel similarities and achieve state-of-the-art performance in both supervised and semi-supervised classification tasks. How to extend KLN to deal with domain adaptation problems is our future work.

Appendix A Heat Maps of Matrix H

Figures 7 and 8 are the heat maps of matrix H on the SVHN and CIFAR-10 dataset, respectively. The conclusions obtained by these four figures are consistent with the MNIST datasets. Moreover, there is no significant difference between matrices H for samples of different labels and for samples with the same label.

Appendix B Parameter Settings of Models

The following Table relates to the network structure of KLN in our experiment. The network structure of CGMMN is the same as that of the encoder part of KLN for a fair comparison.

The network architecture of encoder was set according to the complexity of the dataset. For the MNIST dataset, the fully convolutional network architecture was used in both encoder and decoder; while for the SVHN and CIFAR-10 datasets, dropout layers and pooling layers were used on the encoder to improve the generalization of the network.

TABLE VIII: The network architecture used on MNIST.

Image size

\times

\times

Encoder

4\times 4

conv, chanel 64, stride 2, padding 1

leakyReLU

4\times 4

conv, chanel 128, stride 2, padding 1

batch norm, leakyReLU

7\times 7

conv, chanel 1024, stride 1

batch norm, leakyReLU

1\times 1

conv, chanel 128, stride 1

batch norm, leakyReLU

Dimension of Z

128

Decoder

1\times 1

deconv, chanel 1024, stride 1

batch norm, ReLU

7\times 7

deconv, chanel 128, stride 1

batch norm, ReLU

4\times 4

deconv, chanel 64, stride 2, padding 1

batch norm, ReLU

4\times 4

deconv, chanel 1, stride 2, padding 1

sigmoid

\sigma^{2}

{1,3,5,7,9}

Appendix C Histogram of the kernel similarities

Figures 9, 10 and 11 display the histograms of kernel values computed by using the raw input features, on the MNIST, SVHN and CIFAR-10, respectively. We can see that the histogram distributions of between-classes (the left diagram) and within-classes (the right diagram) have no significant difference.

References

[1] Q. Wang, J. Gao, and Y. Yuan, “A joint convolutional neural networks and context transfer for street scenes labeling,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5, pp. 1457–1470, 2018.
[2] Q. Wang, S. Liu, J. Chanussot, and X. Li, “Scene classification with recurrent attention of VHR remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 2, pp. 1155–1167, 2019.
[3] A. Smola, A. Gretton, L. Song, and B. Schölkopf, “A Hilbert space embedding for distributions,” in International Conference on Algorithmic Learning Theory. Springer, 2007, pp. 13–31.
[4] W. Wang, R. Wang, S. Shan, and X. Chen, “Discriminative covariance oriented representation learning for face recognition with image sets,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5749–5758.
[5] C. X. Ren, Z. Lei, D. Q. Dai, and S. Z. Li, “Enhanced local gradient order features and discriminant analysis for face recognition,” IEEE Transactions on Cybernetics, vol. 46, no. 11, pp. 2656–2669, 2016.
[6] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[7] M. Herbster and M. Pontil, “Prediction on a graph with a perceptron,” in Advances on Neural Information Processing Systems, 2007, pp. 577–584.
[8] Y. Feng, Y. Yang, X. Huang, S. Mehrkanoon, and J. A. Suykens, “Robust support vector machines for classification with nonconvex and smooth losses,” Neural Computation, vol. 28, no. 6, pp. 1217–1247, 2016.
[9] D. M. Witten and R. Tibshirani, “Penalized classification using Fisher’s linear discriminant,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 5, pp. 753–772, 2011.
[10] C. X. Ren, D. Q. Dai, X. He, and H. Yan, “Sample weighting: An inherent approach for outlier suppressing discriminant analysis,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 11, pp. 3070–3083, 2015.
[11] C. X. Ren, D. Q. Dai, X. X. Li, and Z. R. Lai, “Band-reweighted gabor kernel embedding for face image representation and recognition,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 725–740, 2014.
[12] Y. F. Yu, C. X. Ren, D. Q. Dai, and K. Huang, “Kernel embedding multi-orientation local pattern for image representation,” IEEE Transactions on Cybernetics, vol. 48, no. 4, pp. 1124–1135, 2018.
[13] B. Sriperumbudur, A. Gretton, K. Fukumizu, and B. S. Gert Lanckriet, “Injective Hilbert space embeddings of probability measures,” in Annual Conference on Learning Theory, 2008, pp. 111–122.
[14] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf, “Kernel measures of conditional dependence,” in Advances on Neural Information Processing Systems, 2008, pp. 489–496.
[15] A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola, “A kernel method for the two-sample-problem,” in Advances on Neural Information Processing Systems, 2007, pp. 513–520.
[16] Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in International Conference on Machine Learning, 2015, pp. 1718–1727.
[17] C. Li, W. Chang, C. Yu et al., “MMD GAN: Towards deeper understanding of moment matching network,” in Advances on Neural Information Processing Systems, 2017, pp. 2200–2210.
[18] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in American Association for Artificial Intelligence, 2016, pp. 2058–2065.
[19] C. X. Ren, X. L. Xu, and H. Yan, “Generalized conditional domain adaptation: A causal perspective with low-rank translators,” IEEE Transactions on Cybernetics, pp. 1–14, 2018, DOI: 10.1109/TCYB.2018.2874219, Accepted.
[20] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2272–2281.
[21] K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf, and B. K. Sriperumbudur, “Kernel choice and classifiability for RKHS embeddings of probability distributions,” in Advances on Neural Information Processing Systems, 2009, pp. 1750–1758.
[22] Y. Ren, J. Li, Y. Luo, and J. Zhu, “Conditional generative moment-matching networks,” in Advances on Neural Information Processing Systems, 2016, pp. 2928–2936.
[23] M. Long, Y. Cao, J. Wang et al., “Learning transferable features with deep adaptation networks,” in International Conference on Machine Learning, 2015, pp. 97–105.
[24] T. Hinz and S. Wermter, “Image generation and translation with disentangled representations,” in International Joint Conference on Neural Networks (IJCNN), 2018, pp. 1–8.
[25] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arVix preprint, arXiv:1511.06434, 2015.
[26] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2016.
[27] J. T. Springenberg, “Unsupervised and semi-supervised learning with categorical generative adversarial networks,” in International Conference on Learning Representations, 2016, pp. 1–20.
[28] Q. Wang, Y. Yuan, and X. Li, “GETNET: A general end-to-end two-dimensional CNN framework for hyperspectral image change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1, pp. 3–13, 2019.
[29] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza et al., “Generative adversarial nets,” in Advances on Neural Information Processing Systems, 2014, pp. 2672–2680.
[30] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2962–2971.
[31] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in Advances on Neural Information Processing Systems, 2015, pp. 1486–1494.
[32] A. Y. Ng and M. I. Jordan, “On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes,” in Advances on Neural Information Processing Systems, 2002, pp. 841–848.
[33] C. R. Baker, “Joint measures and cross-covariance operators,” Transactions of the American Mathematical Society, vol. 186, no. 186, pp. 273–289, 1973.
[34] L. Song, J. Huang, A. Smola, and K. Fukumizu, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” in International Conference on Machine Learning. ACM, 2009, pp. 961–968.
[35] A. Gretton, B. Sriperumbudur, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, and F. Kenji, “Optimal kernel choice for large-scale two-sample tests,” in Advances on Neural Information Processing Systems, 2012, pp. 1205–1213.
[36] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift,” in International Conference on Machine Learning, 2013, pp. 819–827.
[37] L. Yann, B. Leon, B. Yoshua, and H. Patrick, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[38] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS workshop on deep learning and unsupervised feature learning, vol. 2011, no. 2, 2011, p. 5.
[39] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, 2009.
[40] S. Laine and T. Aila, “Temporal ensembling for semi-supervised learning,” arXiv preprint arXiv:1610.02242, 2016.
[41] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[42] C. Li, J. Zhu, T. Shi, and B. Zhang, “Max-margin deep generative models,” in Advances on Neural Information Processing Systems, 2015, pp. 1837–1845.
[43] M. D. Zeiler and R. Fergus, “Stochastic pooling for regularization of deep convolutional neural networks,” arXiv preprint arXiv:1301.3557, 2013.
[44] I. Goodfellow, D. Wardefarley, M. Mirza et al., “Maxout networks,” in International Conference on Machine Learning, 2013, pp. 1319–1327.
[45] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in International Conference on Artificial Intelligence and Statistics, 2015, pp. 562–570.
[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[47] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional neural networks applied to house numbers digit classification,” in International Conference on Pattern Recognition, 2012, pp. 3288–3291.
[48] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint, arXiv:1412.6806, 2014.
[49] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
[50] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling, “Semi-supervised learning with deep generative models,” Advances on Neural Information Processing Systems, vol. 4, pp. 3581–3589, 2014.
[51] A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko, “Semi-supervised learning with ladder networks,” Advances in Neural Information Processing Systems, 2015.
[52] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther, “Auxiliary deep generative models,” in International Conference on Machine Learning, 2016, pp. 1445–1453.
[53] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in Neural Information Processing Systems, 2016.
[54] C. Li, K. Xu, J. Zhu, and B. Zhang, “Triple generative adversarial nets,” Advances in Neural Information Processing Systems, 2017.