This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

End-to-end Kernel Learning
via Generative Random Fourier Features

Kun Fang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,xiaolinhuang,jieyang}@sjtu.edu.cn
Fanghui Liu LIONS Lab, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
[email protected]
Xiaolin Huang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,xiaolinhuang,jieyang}@sjtu.edu.cn
Jie Yang Department of Automation, Shanghai Jiao Tong University
{fanghenshao,xiaolinhuang,jieyang}@sjtu.edu.cn
Abstract

Random Fourier features (RFFs) provide a promising way for kernel learning in a spectral case. Current RFFs-based kernel learning methods usually work in a two-stage way. In the first-stage process, learning an optimal feature map is often formulated as a target alignment problem, which aims to align the learned kernel with a pre-defined target kernel (usually the ideal kernel). In the second-stage process, a linear learner is conducted with respect to the mapped random features. Nevertheless, the pre-defined kernel in target alignment is not necessarily optimal for the generalization of the linear learner. Instead, in this paper, we consider a one-stage process that incorporates the kernel learning and linear learner into a unifying framework. To be specific, a generative network via RFFs is devised to implicitly learn the kernel, followed by a linear classifier parameterized as a full-connected layer. Then the generative network and the classifier are jointly trained by solving an empirical risk minimization (ERM) problem to reach a one-stage solution. This end-to-end scheme naturally allows deeper features, in correspondence to a multi-layer structure, and shows superior generalization performance over the classical two-stage, RFFs-based methods in real-world classification tasks. Moreover, inspired by the randomized resampling mechanism of the proposed method, its enhanced adversarial robustness is investigated and experimentally verified.

1 Introduction

Kernel methods reveal the non-linear property hidden in data and have been extensively studied in recent decades [1, 2]. The selection of the kernel still remains a non-trivial problem, which requires prior knowledge and directly affects the algorithm performance. Hence, learning a suitable kernel from data has been a universal choice [3, 4]. In recent decades, a series of researches have been devoted to exploit the random Fourier features (RFFs) [5] in the kernel learning task [6, 7].

The pioneering work [5] of RFFs constructed an explicit feature map ϕ\phi to approximate a positive definite, shift-invariant kernel. The feature map ϕ\phi maps the input point 𝐱\mathbf{x} to the so-called random Fourier features ϕ(𝐱,𝐰)\phi(\mathbf{x},\mathbf{w}), where the associated weights 𝐰\mathbf{w} are sampled from the spectral distribution of the kernel function. A linear learner is then conducted on the mapped features to categorize them. Intuitively, RFFs allow us to learn the weights 𝐰\mathbf{w} in a spectral sense, i.e., to learn a kernel [6, 7]. To be specific, to involve the data information, typical approaches [6, 7] proposed to solve the following target alignment problem [8] to learn the weights 𝐰\mathbf{w},

argmax𝐰i,jyiyjϕ(𝐱i,𝐰)Tϕ(𝐱j,𝐰),\mathop{\arg\max}_{\mathbf{w}}\ \sum_{i,j}y_{i}y_{j}\phi(\mathbf{x}_{i},\mathbf{w})^{T}\phi(\mathbf{x}_{j},\mathbf{w}), (1)

where (𝐱i,yi)(\mathbf{x}_{i},y_{i}) and (𝐱j,yj)(\mathbf{x}_{j},y_{j}) denote the ii-th and the jj-th training samples respectively. The inner product of the two mapped features ϕ(𝐱i,𝐰)Tϕ(𝐱j,𝐰)\phi(\mathbf{x}_{i},\mathbf{w})^{T}\phi(\mathbf{x}_{j},\mathbf{w}) denotes the implicit kernel function value k(𝐱i,𝐱j)k(\mathbf{x}_{i},\mathbf{x}_{j}). Sinha and Duchi [6] learned an optimal weight subset of 𝐰\mathbf{w} by solving Eq.(1) for an optimal feature subset of the vanilla RFFs, and proved the consistency and generalization bound. The proposed algorithm [6] is efficient and highly scalable. Li et al. [7] incorporated a deep neural network to implicitly learn the kernel, and trained the network by solving Eq.(1). The work in [7] shows that the performance could be improved by involving a network to model the kernel distribution.

These kernel learning methods [6, 7] follow a two-stage scheme: They first learn the random features ϕ(𝐱,𝐰)\phi(\mathbf{x},\mathbf{w}), which is equivalent to learning the weights 𝐰\mathbf{w} since the map ϕ\phi is explicit, by solving Eq.(1), and then learn a linear classifier on the features. The optimization target of Eq.(1) is to align the learned kernel with the pre-defined ideal kernel 𝐲𝐲T\mathbf{y}\mathbf{y}^{T}. However, this commonly-used ideal kernel in target alignment is not necessarily optimal for the generalization of the linear classifier learned in the second stage. Hence, in such a two-stage way, the random features learned in the first stage perhaps show excellent approximation performance towards the ideal kernel, but do not take much care of the generalization or classification performance, which could be further improved.

To address this issue, towards learning a more well-generalized kernel, we propose to jointly learn the random features and the classifier in an end-to-end way by straightly solving the following expectation risk minimization problem,

argminG,C𝔼(𝐱,y)L(C(ϕ(𝐱,𝔼𝐧G(𝐧))),y),\mathop{\arg\min}_{G,C}\ \mathbb{E}_{(\mathbf{x},y)}L\Big{(}C\big{(}\phi\big{(}\mathbf{x},\mathbb{E}_{\mathbf{n}}G(\mathbf{n})\big{)}\big{)},y\Big{)}, (2)

where L(,)L(\cdot,\cdot) denotes the loss function and 𝐧\mathbf{n} denotes the noise random variable. A generative network G()G(\cdot), a.k.a. a generator, is devised to learn the kernel distribution, which takes a set of noises as input and samples weights from the learned kernel distribution. Then, the corresponding random features are constructed based on the generated weights according to the feature map ϕ\phi. A linear classifier C()C(\cdot) parameterized as a full-connected (FC) layer is applied to the random features to categorize them. The generative network and linear classifier are jointly trained by directly minimizing the loss between the true labels and predicted ones. Therefore, we achieve an end-to-end, one-stage solution, which no longer pursues the approximation ability of random features towards any pre-defined kernels. Instead, it is expected that distributions of the underlying kernel can be modeled by the generative network for better classification or generalization performance. The proposed method is called Generative RFFs (GRFFs) since the random features are built via the generative network.

Besides, this end-to-end training strategy naturally allows deeper features, in correspondence to a multi-layer structure of GRFFs. An associated progressively training strategy is designed to empirically guarantee the convergence of the multi-layer GRFFs. The multi-layer structure further boosts the generalization performance, since a good kernel on features is learned in this way. Moreover, the randomized resampling mechanism of GRFFs yields the non-deterministic weights and brings the merit of adversarial robustness gains, which is investigated and experimentally verified.

Our contributions can be summarized as follows,

  • To the best of our knowledge, this is the first work in RFF-based kernel learning algorithms to incorporate the kernel learning and the linear learner into a unifying framework via generative models for an end-to-end and one-stage solution. Empirical results indicate the superior generalization performance over the classical two-stage, RFFs-based methods.

  • The end-to-end strategy enables us to employ a multi-layer structure for GRFFs, indicating a good kernel on deeper features. A progressively training strategy is devised to efficiently train the multiple generators. The multi-layer structure further advances the generalization performance.

  • The adversarial robustness of the proposed method is also investigated. Empirical results show that, to some degree, the randomized resampling mechanism associated with the learned distributions can alleviate the performance decrease against adversarial attacks.

The rest of this paper is organized as follows. We briefly introduce the random Fourier features and the generative models in Section 2. The one-stage framework, multi-layer structure, and progressively training strategy of the proposed method are explained in detail in Section 3. Experiment results on classification tasks and adversarial robustness are shown in Section 4. The discussions and the conclusions are drawn in Section 5.

2 Preliminaries

2.1 Random Fourier Features

In kernel methods, a positive definite kernel k(𝐱,𝐱)k(\mathbf{x},\mathbf{x^{\prime}}) with 𝐱,𝐱d\mathbf{x},\mathbf{x^{\prime}}\in\mathcal{R}^{d} defines a map Φ:d\Phi:\mathcal{R}^{d}\to\mathcal{H}, which satisfies k(𝐱,𝐱)=Φ(𝐱),Φ(𝐱)k(\mathbf{x},\mathbf{x^{\prime}})=\langle\Phi(\mathbf{x}),\Phi(\mathbf{x^{\prime}})\rangle_{\mathcal{H}}, where ,\langle\cdot,\cdot\rangle_{\mathcal{H}} denotes the inner product in a Reproducing Kernel Hilbert Space \mathcal{H}. However, in large-scale kernel machines, there exist unacceptable high computation and memory costs: 𝒪(n2)\mathcal{O}(n^{2}) kernel evaluations, 𝒪(n2)\mathcal{O}(n^{2}) in space, and even 𝒪(n3)\mathcal{O}(n^{3}) in time to compute the inverse of the kernel matrix. Therefore, randomized features are introduced in large-scale cases to approximate the kernel function [5]. The theoretical foundation is based on the Bochner’s theorem [9], which indicates that the Fourier transform of a kernel function is associated to a probability distribution: A continuous and shift-invariant kernel k(𝐱,𝐱)=k(𝐱𝐱)k(\mathbf{x},\mathbf{x^{\prime}})=k(\mathbf{x}-\mathbf{x^{\prime}}) on d\mathcal{R}^{d} is positive definite if and only if k()k(\cdot) is the Fourier transform of a non-negative measure p(𝐰)p(\mathbf{w}). That is,

k(𝐱𝐱)=dp(𝐰)ej𝐰T(𝐱𝐱)d𝐰.k(\mathbf{x}-\mathbf{x^{\prime}})=\int_{\mathcal{R}^{d}}p(\mathbf{w})e^{\mathrm{j}\mathbf{w}^{T}(\mathbf{x}-\mathbf{x^{\prime}})}\mathrm{d}\mathbf{w}. (3)

Considering a real-valued, shift-invariant and positive definite kernel kk, by applying the Euler’s formula ejx=cos(x)+jsin(x)e^{\mathrm{j}x}=\cos(x)+\mathrm{j}\sin(x) to Eq.(3), we have

k(𝐱𝐱)=Re[k(𝐱𝐱)]=dp(𝐰)cos(𝐰T(𝐱𝐱))d𝐰.=dp(𝐰)(cos(𝐰T𝐱)cos(𝐰T𝐱)+sin(𝐰T𝐱)sin(𝐰T𝐱))d𝐰=𝔼𝐰[[cos(𝐰T𝐱)sin(𝐰T𝐱)]T[cos(𝐰T𝐱)sin(𝐰T𝐱)]].\begin{split}k(\mathbf{x}-\mathbf{x^{\prime}})&=\mathrm{Re}[k(\mathbf{x}-\mathbf{x^{\prime}})]=\int_{\mathcal{R}^{d}}p(\mathbf{w})\cos(\mathbf{w}^{T}(\mathbf{x}-\mathbf{x^{\prime}}))\mathrm{d}\mathbf{w}.\\ =&\mathop{\int}\limits_{\mathcal{R}^{d}}p(\mathbf{w})(\cos(\mathbf{w}^{T}\mathbf{x})\cos(\mathbf{w}^{T}\mathbf{x^{\prime}})+\sin(\mathbf{w}^{T}\mathbf{x})\sin(\mathbf{w}^{T}\mathbf{x^{\prime}}))\mathrm{d}\mathbf{w}\\ =&\ \mathbb{E}_{\mathbf{w}}\Big{[}\big{[}\cos(\mathbf{w}^{T}\mathbf{x})\ \sin(\mathbf{w}^{T}\mathbf{x})\big{]}^{T}\big{[}\cos(\mathbf{w}^{T}\mathbf{x^{\prime}})\ \sin(\mathbf{w}^{T}\mathbf{x^{\prime}})\big{]}\Big{]}.\end{split} (4)

Therefore, one can construct an explicit feature map ϕ:d2D\phi:\mathcal{R}^{d}\to\mathcal{R}^{2D} as follows:

ϕ(𝐱,{𝐰i}i=1D)1D[cos(𝐰1T𝐱)cos(𝐰DT𝐱)sin(𝐰1T𝐱)sin(𝐰DT𝐱)],\phi(\mathbf{x},\{\mathbf{w}_{i}\}^{D}_{i=1})\equiv\sqrt{\frac{1}{D}}\big{[}\cos(\mathbf{w}^{T}_{1}\mathbf{x})\cdots\cos(\mathbf{w}^{T}_{D}\mathbf{x})\ \sin(\mathbf{w}^{T}_{1}\mathbf{x})\cdots\sin(\mathbf{w}^{T}_{D}\mathbf{x})\big{]}, (5)

known as the random Fourier features [5]. The set of weights {𝐰i}i=1D\{\mathbf{w}_{i}\}^{D}_{i=1} is sampled from the spectral distribution of the kernel function. The mapped random Fourier features satisfy ϕ(𝐱,{𝐰i}i=1D)Tϕ(𝐱,{𝐰i}i=1D)k(𝐱,𝐱)\phi(\mathbf{x},\{\mathbf{w}_{i}\}^{D}_{i=1})^{T}\phi(\mathbf{x^{\prime}},\{\mathbf{w}_{i}\}^{D}_{i=1})\approx k(\mathbf{x},\mathbf{x^{\prime}}) [5]. A detailed analysis of the convergence can be found in [10].

The vanilla RFFs [5] accelerate the large-scale kernel machines a lot and are data-independent. Afterwards, RFFs have been widely utilized, including the kernel approximation [11], extreme learning machines [12], the bias-variance trade-off in machine learning [13] and the kernel learning task [6, 7]. A systematic survey on RFFs-based algorithms can be found in [14].

Inspired by the vanilla RFFs, learning the features is equal to learning the weights {𝐰i}i=1D\{\mathbf{w}_{i}\}^{D}_{i=1}, while learning the weights is equal to learning the kernel distribution. Hence, to efficiently learn the distribution, generative models are introduced.

2.2 Generative Models

Generative models have been widely applied in learning distributions from data [15]. Various generative models can be categorized into two types: models that perform explicit probability density estimations, and models that perform as a sampler sampling from the estimated distribution without a precise probability function.

Suppose a training set includes samples from a distribution data\mathbb{P}_{data}, then the first type of generative models returns an estimation model\mathbb{P}_{model} of data\mathbb{P}_{data}. Given a particular value as input, the estimation model\mathbb{P}_{model} will output the corresponding probability. While the second type tries to learn the latent data\mathbb{P}_{data} as well, but in a rather different way: The learned model actually simulates a sampling process, through which one can generate more samples from the estimated distribution.

There exist lots of researches in the second type of generative models using a deep neural network as the sampler. In computer vision, the family of generative adversarial networks (GANs) [16] has shown great generality in various situations [17, 18]. The generative network in GANs performs as a useful image generator, which generates images by sampling from the learned image distribution. Besides, in Bayesian deep learning, there is a series of hypernetwork-based works, which use a neural network, called hypernet, to generate the parameters in another neural network, called primary net. The hypernet also performs as a sampler to learn the parameter posterior distribution of the primary net [19, 20].

Therefore, the generative networks are employed in the proposed GRFFs due to the powerful distribution learning ability. The resulting merits are two-fold: On the one hand, the GRFFs could learn a well-generalized kernel distribution from data. On the other hand, the generative networks could produce non-deterministic weights, bringing the adversarial robustness gains.

3 Generative Random Fourier Features Model

3.1 One-stage Generative Random Fourier Features

Previous RFFs-based kernel learning methods construct random features via sampling from the spectral distribution of the kernel [5] or via approximating the ideal kernel [6, 7], and then train a classifier on these features. The classification performance of the random features learned in this two-stage manner can perhaps be further improved. Therefore, we propose generative RFFs to jointly learn the features and the classifier by optimizing the expectation risk minimization problem in an end-to-end manner, which leads to a one-stage solution with better classification performance.

We first describe a general framework of our one-stage model, illustrated in Fig.1. A generator is designed to learn and to sample from the latent kernel distribution. Then, the generative random Fourier features of the original data are constructed by the sampled weights from the learned distribution. Finally, a linear classifier is applied to the features to categorize them.

Refer to caption
Figure 1: An illustration of the one-stage generative random Fourier features.

The generator in our method performs as a sampler. It implicitly learns some distribution k\mathbb{P}_{k} and generates samples from it. Given an arbitrary noise distribution 0\mathbb{P}_{0} and a set of noises 𝐍={𝐧i}i=1D\mathbf{N}=\{\mathbf{n}_{i}\}^{D}_{i=1} sampled from 0\mathbb{P}_{0}, the generator ΦG\Phi_{G} takes them as input and generates the corresponding set of weights {𝐰i}i=1D\{\mathbf{w}_{i}\}^{D}_{i=1} sampled from k\mathbb{P}_{k}:

𝐰i=ΦG(𝐧i),𝐧i0,𝐰ik,i=1,,D.\mathbf{w}_{i}=\Phi_{G}(\mathbf{n}_{i}),\mathbf{n}_{i}\sim\mathbb{P}_{0},\mathbf{w}_{i}\sim\mathbb{P}_{k},i=1,...,D. (6)

Given a training set {(𝐱i,yi)}i=1n\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}, the generative random Fourier features 𝐳i\mathbf{z}_{i} of each training point 𝐱i\mathbf{x}_{i} will be constructed with full use of the DD weights {𝐰j}j=1D\{\mathbf{w}_{j}\}^{D}_{j=1} according to Eq.(5):

𝐳i=ϕ(𝐱i,{𝐰j}j=1D),𝐰jk,i=1,,n.\mathbf{z}_{i}=\phi(\mathbf{x}_{i},\{\mathbf{w}_{j}\}^{D}_{j=1}),\mathbf{w}_{j}\sim\mathbb{P}_{k},i=1,...,n. (7)

Notice that the weight 𝐰j\mathbf{w}_{j} and the data point 𝐱i\mathbf{x}_{i} are of the same dimension and that the dimension of 𝐳i\mathbf{z}_{i} equals to 2D2D. In the end, the linear classifier ΦC\Phi_{C} will be applied on the GRFFs 𝐳i\mathbf{z}_{i} to predict the label y^i=ΦC(𝐳i)\hat{y}_{i}=\Phi_{C}(\mathbf{z}_{i}).

Denote θG\theta_{G} and θC\theta_{C} as the parameters of the generator ΦG\Phi_{G} and the linear classifier ΦC\Phi_{C} respectively. Combining Eq.(6) and Eq.(7), we have the following empirical risk minimization (ERM) problem:

minθG,θC1ni=1nL(ΦC(ϕ(𝐱i,ΦG(𝐍))),yi).\mathop{\min}_{\theta_{G},\theta_{C}}\frac{1}{n}\sum_{i=1}^{n}L\Big{(}\Phi_{C}\big{(}\phi\big{(}\mathbf{x}_{i},\Phi_{G}(\mathbf{N})\big{)}\big{)},y_{i}\Big{)}. (8)

3.2 Multi-layer Generative Random Fourier Features

Refer to caption
Figure 2: An illustration of the multi-layer structure of GRFFs.

The end-to-end training strategy by straightly optimizing the ERM problem naturally allows us to employ a multi-layer structure of GRFFs, shown in Fig.2. To be specific, with one generator, we can build the random features of the original data, which can be viewed as the first layer of features. Then, with another generator, a second layer of random features can be constructed in the same way based on the features from the previous layer. After such a layer-by-layer operation, the random features in the last layer are followed with an FC layer. The total networks, i.e., all the generators and the last FC layer, are again jointly trained to solve the ERM problem. In this way, in each layer, there exists a corresponding generator which models the specific distribution on this layer of features, indicating a good kernel learned on features.

In those two-stage approaches [5, 6, 7], the distribution is associated with a pre-defined kernel, which is restricted to a single-layer structure, since the kernel for the multi-layer case is not clear. By contrast, we cover the multi-layer case, and thus have an advantage in learning a much more linear-separable pattern from data under the guidance of the learned kernels on features.

Without loss of generality, a detailed elaboration on the two-layer structure is presented below as an example, which could be naturally generalized to any number of layers (see Alg.1). Denote ΦG1\Phi_{G_{1}} and ΦG2\Phi_{G_{2}} as the two generators in the first and second layers respectively. The two generators, ΦG1\Phi_{G_{1}} and ΦG2\Phi_{G_{2}}, take two sets of noises 𝐍1={𝐧i1}i=1D1\mathbf{N}_{1}=\{\mathbf{n}_{i}^{1}\}_{i=1}^{D_{1}} and 𝐍2={𝐧i2}i=1D2\mathbf{N}_{2}=\{\mathbf{n}_{i}^{2}\}_{i=1}^{D_{2}} independently sampled from the same distribution 0\mathbb{P}_{0} as input respectively, and generate the corresponding weights 𝐖1={𝐰i1}i=1D1\mathbf{W}_{1}=\{\mathbf{w}_{i}^{1}\}_{i=1}^{D_{1}} and 𝐖2={𝐰i2}i=1D2\mathbf{W}_{2}=\{\mathbf{w}_{i}^{2}\}_{i=1}^{D_{2}}:

𝐰i1=ΦG1(𝐧i1),𝐧i10,𝐰i1k1,i=1,,D1,𝐰i2=ΦG2(𝐧i2),𝐧i20,𝐰i2k2,i=1,,D2,\begin{split}\mathbf{w}_{i}^{1}=\Phi_{G_{1}}(\mathbf{n}_{i}^{1}),\mathbf{n}_{i}^{1}\sim\mathbb{P}_{0},\mathbf{w}_{i}^{1}\sim\mathbb{P}_{k_{1}},i=1,...,D_{1},\\ \mathbf{w}_{i}^{2}=\Phi_{G_{2}}(\mathbf{n}_{i}^{2}),\mathbf{n}_{i}^{2}\sim\mathbb{P}_{0},\mathbf{w}_{i}^{2}\sim\mathbb{P}_{k_{2}},i=1,...,D_{2},\end{split} (9)

where k1\mathbb{P}_{k_{1}} and k2\mathbb{P}_{k_{2}} are the distributions modeled by ΦG1\Phi_{G_{1}} and ΦG2\Phi_{G_{2}} respectively.

In the first layer, a training sample 𝐱i\mathbf{x}_{i} cooperates together with 𝐖1\mathbf{W}_{1} to construct the generative random Fourier feature 𝐳i1\mathbf{z}_{i}^{1} according to Eq.(5):

𝐳i1=ϕ(𝐱i,{𝐰j1}j=1D1),𝐰j1k1,i=1,,n.\mathbf{z}_{i}^{1}=\phi(\mathbf{x}_{i},\{\mathbf{w}_{j}^{1}\}^{D_{1}}_{j=1}),\mathbf{w}_{j}^{1}\sim\mathbb{P}_{k_{1}},i=1,...,n. (10)

Then, the random feature 𝐳i1\mathbf{z}_{i}^{1} cooperates together with the generative weights 𝐖2\mathbf{W}_{2} from the second layer to construct the generative random Fourier feature in the second layer 𝐳i2\mathbf{z}_{i}^{2} in the same way:

𝐳i2=ϕ(𝐳i1,{𝐰j2}j=1D2),𝐰j2k2,i=1,,n.\mathbf{z}_{i}^{2}=\phi(\mathbf{z}_{i}^{1},\{\mathbf{w}_{j}^{2}\}^{D_{2}}_{j=1}),\mathbf{w}_{j}^{2}\sim\mathbb{P}_{k_{2}},i=1,...,n. (11)

Again, notice that 𝐰j1\mathbf{w}^{1}_{j} and 𝐱i\mathbf{x}_{i} are of the same dimension and that the dimension of 𝐳i1\mathbf{z}^{1}_{i} equals to 2D12D_{1}. Accordingly, the dimension of 𝐰j2\mathbf{w}^{2}_{j} equals to 2D12D_{1} and the dimension of 𝐳i2\mathbf{z}^{2}_{i} equals to 2D22D_{2}. Finally, there is a linear classifier ΦC\Phi_{C} applied to the last random feature 𝐳i2\mathbf{z}_{i}^{2} to output the prediction.

Now we rewrite the ERM problem in Eq.(8) for the two-layer GRFFs as follows,

minθG1,θG2,θC1ni=1nL(ΦC(ϕ(𝐳i1,ΦG2(𝐍2))),yi),𝐳i1=ϕ(𝐱i,ΦG1(𝐍1)).\mathop{\min}_{\theta_{G_{1}},\theta_{G_{2}},\theta_{C}}\frac{1}{n}\sum_{i=1}^{n}L\Big{(}\Phi_{C}\big{(}\phi\big{(}\mathbf{z}_{i}^{1},\Phi_{G_{2}}(\mathbf{N}_{2})\big{)}\big{)},y_{i}\Big{)},\ \mathbf{z}_{i}^{1}=\phi\big{(}\mathbf{x}_{i},\Phi_{G_{1}}(\mathbf{N}_{1})\big{)}. (12)
Refer to caption
(a) Loss curves
Refer to caption
(b) Accuracy curves
Figure 3: An example of the training process on a bi-classification task. A two-layer structure is adopted. Loss and accuracy variations are recorded. There are evident performance leaps, i.e., the drop of losses and the step-up of accuracies, by adding a second layer at the turning point of 200-th epoch. Detailed settings can be found in section 4.

Advantages of the multi-layer structure are obvious. The generator in the current layer actually models some distribution on features from the previous layer, indicating a good kernel on features. Besides, associated with the proposed progressively training strategy (introduced in section 3.3), during the whole training process, by updating the generators layer-by-layer, there are distinct improvements on both the loss curves and the accuracy curves (see Fig.3), implying that adding more layers leads to significant performance leaps. More details about the distributions on features are illustrated in section 4.

3.3 Progressively Training

For the multi-layer generative random Fourier features, since there exist several generative networks, it is hard to efficiently update all the parameters in multiple networks simultaneously, which possibly results in a total failure of convergence [21]. Therefore, inspired by the training in ProGAN [18], we propose a progressively training strategy to efficiently train the multiple generators in an inverse and layer-by-layer order.

The progressively training strategy contains several phases. The number of phases equals to the number of generators. In the first phase, we freeze the update of all the parameters except the last generator and the FC layer. After the training converges, we step into the next phase. Now the penultimate generator is unfrozen and gets updated together with the last one and the FC layer until convergence, while the others are still kept fixed. In such an inverse and layer-by-layer order, we progressively unfreeze the generators in each layer, add them to the training sequence one by one, and finally train all these generators together with the FC layer, which is empirically proved as an efficient training way for the multi-layer structure. This progressively training alogrithm is demonstrated in Alg.1.

In addition, more details of the generator are described. Every generator is parameterized as a multi-layer perceptron (MLP), which can be viewed as a composition of several cascaded blocks. Among these blocks, except for the last one, each block sequentially holds a linear layer, a batch normalization [22] layer, and an activation layer with leaky ReLU function. In the last block, we remove the batch normalization layer and replace leaky ReLU with tanh function as the activation. Fig.1 illustrates the described components inside the generator. Furthermore, one can customize the network according to the practical cases. For example, it is allowed to increase the number of blocks in the MLP to enhance its learning ability or to add dropout [23] to alleviate overfitting. The number of neurons in the last FC layer can be modified freely for multi-classification or regression tasks.

Algorithm 1 The progressively training for GRFFs.
0:  Training set XtrX_{tr}, number of generators KK, numbers of epochs epo1,epo2,,epoKepo_{1},epo_{2},\cdots,epo_{K} w.r.t. KK training phases.
0:  KK generators ΦG1,ΦG2,,ΦGK\Phi_{G_{1}},\Phi_{G_{2}},\cdots,\Phi_{G_{K}} and a linear classifier ΦC\Phi_{C}.
1:  while not reach i=1Kepoi\sum_{i=1}^{K}epo_{i} do
2:     Sample KK sets of noises from 0\mathbb{P}_{0}: 𝐍i,i=1,,K\mathbf{N}_{i},i=1,\cdots,K.
3:     Take a batch of samples (𝐱,y)(\mathbf{x},y) from XtrX_{tr}.
4:     Build features at the kk-th layer sequentially: 𝐙k=ϕ(𝐙k1,ΦGk(𝐍k)),𝐙0=𝐱,k=1,,K\mathbf{Z}^{k}=\phi(\mathbf{Z}^{k-1},\Phi_{G_{k}}(\mathbf{N}_{k})),\ \mathbf{Z}^{0}=\mathbf{x},\ k=1,\cdots,K.
5:     Determine the loss: loss=L(ΦC(𝐙K),y)loss=L(\Phi_{C}(\mathbf{Z}^{K}),y).
6:     for j=1,,Kj=1,\cdots,K do \eqparboxCOMMENTthe jj-th training phase
7:        if not reach i=1jepoi\sum_{i=1}^{j}epo_{i} then \eqparboxCOMMENTinverse order
8:           Update ΦGKj+1,,ΦGK\Phi_{G_{K-j+1}},\cdots,\Phi_{G_{K}} and ΦGC\Phi_{G_{C}} via loss\nabla loss by Adam.
9:           Break.
10:        end if
11:     end for
12:  end while

The standard normal distribution is set as the input noise distribution 0\mathbb{P}_{0}, which corresponds to a radial basis function (RBF) kernel, the most widely used universal kernel. At each iteration, we independently resample noises from 0\mathbb{P}_{0} to optimize the generators on the expectation of the distribution, and update simultaneously the parameters of the generators and the classifier by Adam [24] optimizer. We choose the cross entropy function as the loss function LL. One can estimate whether the model is trained until convergence by watching the variation of the cross entropy loss value, with which the final classification performance is closely connected.

For the inference process, given a new sample 𝐱new\mathbf{x}_{new}, random noises are first sampled from 0\mathbb{P}_{0}, then the weights are generated via the generators, then the corresponding generative random Fourier feature 𝐳new\mathbf{z}_{new} is constructed via the weights and 𝐱new\mathbf{x}_{new} by applying Eq.(7) or Eq.(10), Eq.(11). Finally, the linear classifier ΦC\Phi_{C} will predict the label of 𝐱new\mathbf{x}_{new}.

So far, we finish building an end-to-end, one-stage kernel learning method, which implicitly learns the kernel distribution by solving an ERM problem via the generative RFFs. To learn the latent kernel distribution, a generative network is designed to simulate the sampling process without an explicit definition of the probability density function of the distribution. Besides, jointly learning the features and the classifier allows us to increase the depth of features. The resulting multi-layer structure can learn a good kernel on features and thus even boosts the generalization performance. In addition, a progressively training strategy is proposed to efficiently train the multiple generative networks.

Remark We would like to reiterate the following facts. It’s worth noting that the initial motivation of RFFs-based methods is to reduce the time and space complexity in large-scale kernel machines. For instance, the vanilla RFFs [5] speed up the kernel machines a lot, then get even accelerated in [6]. Further, random features are then extended to kernel learning, since RFFs actually provide a good justification for kernel learning in the spectral cases. The researches in [6, 7] are two typical kernel learning algorithms via random features, see a recent survey [14] on the development of RFFs-based algorithms for kernel learning. Regarding to our method, the target is to perform kernel learning via RFFs, but NOT reducing the time-consuming. In fact, since the neural network requires training, our method and the method in [7] are inevitably inferior in the processing speed during training and inference, compared with those algorithms specifically designed for kernel acceleration [5, 6]. Instead, the purpose of our method is to learn a more well-generalized kernel based on the fact that the optimization objective in [6, 7], i.e., solving the target alignment, is not necessarily optimal in classification tasks. To achieve such a purpose, we introduce the generative networks and adopt an end-to-end scheme to reach a one-stage solution. The resulting two merits of this solution, better generalization and stronger robustness, will be empirically illustrated in the next section.

4 Experiments and Results

We evaluate the classification performance of the proposed GRFFs on a wide range of data sets. Following the work in [6, 7], we first test the performance on a synthetic data set. Second, we choose several benchmark data sets from UCI repository111https://archive.ics.uci.edu/ml/datasets.html and LIBSVM data222https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/, including both large-scale and small-scale sets, and test the performance on these data sets. Finally, we introduce a variant of the multi-layer GRFFs for image data and conduct an adversarial attack on this variant to illustrate its adversarial robustness.

The simplest version of GRFFs only includes a single layer, i.e., one single generative network, denoted as SL-GRFF. We denote the multi-layer GRFFs including more than one generators as ML-GRFF. We compare SL-GRFF and ML-GRFF with the following approaches.

  • RFF. Rahimi and Recht [5] first proposed the vanilla RFFs to approximate the kernel function by sampling from its spectral distribution. This approach is data-independent.

  • OPT-RFF. Based on the vanilla RFFs, Sinha and Duchi [6] proposed to optimize the random features by solving a target alignment problem. A feature subset of an optimal size is learned, which shows great superiority in high processing speed in large-scale kernel machines.

  • IKL. Li et al. [7] adopted a neural network to model the spectral distribution of the kernel function. The network is trained by optimizing a target alignment problem. A linear classifier is applied on the features.

  • MLP. The traditional MLP with ReLU activation function is also included in the comparison. Specifically, the numbers of layers in the MLP equal to that in the ML-GRFF, and the numbers of neurons in each hidden layer in the MLP equal to the numbers of sampled noises in each layer in the ML-GRFF.

In all the experiments, for RFF and OPT-RFF, the random features are generated by sampling weights from a normal distribution, which makes the features approximate an RBF kernel. Regarding to the number of sampled weights, i.e., the value of DD in Eq.(5), following the implement333https://github.com/duchi-lab/learning-kernels of OPT-RFF [6], a large enough number of weights are initially sampled in OPT-RFF to learn a feature subset of an optimal size DoptD_{opt}. Then this learned DoptD_{opt} is employed to build the vanilla RFFs [5]. The ridge regression is set as the linear classifier in the second stage of RFF and OPT-RFF. We only compare with the results of IKL claimed in the paper [7] on the synthetic data set in section 4.1.

For our proposed GRFFs, in section 4.1 and 4.2, for the multi-layer version, we adopt a two-layer structure including two generators, each of which is parameterized as an MLP. We set D1=256D_{1}=256 and D2=64D_{2}=64. The structures of the first and second generators are 1001286464dim100\to 128\to 64\to 64\to dim and 100512256256512100\to 512\to 256\to 256\to 512 respectively, where dimdim denotes the input data dimension. The FC layer contains 2D2=1282D_{2}=128 neurons. For the single-layer GRFFs in section 4.1, the generator is also parameterized as an MLP and holds a structure of 1001286464dim100\to 128\to 64\to 64\to dim. We set D=256D=256, thus the following FC layer contains 2D=5122D=512 neurons. We discuss GRFFs with more than 2 generators in A.

4.1 Performance on Synthetic Data

For the synthetic set, we generate {𝐱i}i=1n𝒩(0,Id)\{\mathbf{x}_{i}\}_{i=1}^{n}\sim\mathcal{N}(0,I_{d}) with yi=sign(𝐱i2d)y_{i}=\mathrm{sign}(\|\mathbf{x}_{i}\|^{2}-\sqrt{d}), where dd is data dimension. The training set includes 104{10}^{4} samples and the test set contains 103{10}^{3} samples. Fig.4a shows the data distribution when d=2d=2. An RBF kernel is ill-suited for this data set since the Euclidean distance used in this kernel fails to correctly capture the data pattern near the boundary [6]. One may need to carefully tune the kernel width to alleviate the inherent inappropriateness between the RBF kernel and the data set.

Refer to caption
(a) Synthetic data when d=2d=2
Refer to caption
(b) Test errors vs. dimension
Refer to caption
(c) Errors vs. dimension
Figure 4: Performance on the synthetic data. (a) Data distribution when the dimension equals to 2. (b) Misclassification errors on the test set of different methods. (c) Misclassification errors on the training and test sets of SL-GRFF and ML-GRFF.
Refer to caption
(a) PCA on SL-GRFF, d=8d=8.
Refer to caption
(b) PCA on ML-GRFF in the 1st layer, d=8d=8.
Refer to caption
(c) PCA on ML-GRFF in the 2nd layer, d=8d=8.
Refer to caption
(d) PCA on SL-GRFF, d=24d=24.
Refer to caption
(e) PCA on ML-GRFF in the 1st layer, d=24d=24.
Refer to caption
(f) PCA on ML-GRFF in the 2nd layer, d=24d=24.
Figure 5: PCA visualizations of features of SL-GRFF and ML-GRFF, corresponding to synthetic data when d=8d=8 (top) and d=24d=24 (bottom) respectively. The blue diamond points denote positive samples, while the red pentagram points denote negative samples.
Refer to caption
(a) Loss curves when d=2d=2
Refer to caption
(b) Accuracy curves when d=2d=2
Refer to caption
(c) Loss curves when d=6d=6
Refer to caption
(d) Accuracy curves when d=6d=6
Refer to caption
(e) Loss curves when d=8d=8
Refer to caption
(f) Accuracy curves when d=8d=8
Figure 6: Average loss and accuracy at each epoch on the synthetic data of different dimensions. The green curves denote records on the test set, while the magenta curves denote records on the training set. The turning point of the 1st and 2nd stages is the 200-th epoch.

Fig.4b illustrates the test errors of different methods corresponding to different data dimensions d{2,4,6,8,,18,20}d\in\{2,4,6,8,...,18,20\}. Both RFF and OPT-RFF manifest a sharp performance decrease along with the increase of dd. By contrast, IKL and ML-GRFF both have rather stable performance as the dimension dd increases, since they try to learn the kernel distribution via a neural network without any prior kernel function definition. Particularly, by directly optimizing the ERM problem in an end-to-end manner, ML-GRFF characterizes the data pattern much better than the others and achieves the lowest test errors. Fig.4c shows the comparison results on the training and test sets of SL-GRFF and ML-GRFF respectively. In a wide range of data dimensions, ML-GRFF always has an evident superiority of classification errors on the test set over SL-GRFF.

Besides, to visualize the generative RFFs, we take PCA [25] to extract the top-3 principal components of random features in SL-GRFF and different layers of ML-GRFF, shown in Fig.5. SL-GRFF and ML-GRFF are trained on the synthetic data of two dimensions, d=8d=8 and d=24d=24, respectively. When d=8d=8, both SL-GRFF and ML-GRFF have similar performance, while when d=24d=24, ML-GRFF performs better than SL-GRFF. For ML-GRFF, the extracted principal components in the 1st layer (see Fig.5b and Fig.5e) do not show any evident linear separability. But, the components in the 2nd layer (see Fig.5c and Fig.5f) follow a clear linear separable distribution. For SL-GRFF, we can also see the linear separability of the extracted components from Fig.5a and Fig.5d, which is obviously weaker than that of the extracted components of the 2nd layer in ML-GRFF. Hence, by going deeper and deeper, ML-GRFF is able to learn a good kernel on features, which results in better performance.

To illustrate the traits of the progressively training strategy, we record the loss and accuracy curves during training on the synthetic data of different dimensions in Fig.6. When the data is easy to be separated, i.e., it is low-dimensional (d=2d=2), the whole training process does not show much differences between the two training phases in Fig.6a and Fig.6b. As the dimension increases, there are obvious improvements, i.e., the drop of loss and the step-up of accuracy, from the 1st training phase to the 2nd training phase in Fig.6c \sim Fig.6f, at the turning point of the 200-th epoch. By progressively training the generators layer by layer, the convergence is achieved and there is good performance.

4.2 Performance on Benchmark Data Sets

The models are trained on 77 small-scale data sets and 33 large-scale data sets. We randomly pick half of the data as the training set, the other half as the test set, except for the data sets of which the training and test sets have already been divided. After normalizing the data to [0,1]d{[0,1]}^{d} in advance by min-max normalization, 20%20\% of the training data are randomly selected as the validation set. For RFF and OPT-RFF, the validation set is adopted to perform validations on a group of 10 γ\gamma-s ranging from 0.50.5 to 1.41.4 with an interval of 0.10.1, where γ\gamma is the steep hyper-parameter in the RBF kernel: k(𝐱,𝐱)=exp(γ𝐱𝐱2)k(\mathbf{x},\mathbf{x}^{\prime})=\exp(-\gamma||\mathbf{x}-\mathbf{x}^{\prime}||^{2}). For MLP and ML-GRFF, during training, models that achieve best performance on the partitioned validation set are selected. All the experiments on every data set are repeated 5 times.

Table 1: Training and test accuracy (%) on small-scale data sets of different models. The highest test accuracy is highlighted in bold.
method data set monks1 monks2 monks3 australia climate diabetic sonar
(#tr;#te;dd) (124;432;6) (169;432;6) (122;432;6) (345;345;14) (270;270;18) (576;575;19) (104;104;60)
RFF train 95.48±3.1595.48\pm 3.15 98.82±1.7398.82\pm 1.73 97.38±0.6997.38\pm 0.69 93.51±2.5793.51\pm 2.57 99.93±0.1799.93\pm 0.17 79.90±2.0079.90\pm 2.00 100.00±0.01100.00\pm 0.01
test 81.48±1.8381.48\pm 1.83 78.66±1.5678.66\pm 1.56 93.61±0.4593.61\pm 0.45 84.70±1.5384.70\pm 1.53 91.48±1.7291.48\pm 1.72 73.37±1.39\mathbf{73.37\pm 1.39} 77.88±4.9077.88\pm 4.90
OPT-RFF train 100.00±0.01100.00\pm 0.01 91.83±3.2891.83\pm 3.28 96.23±1.6096.23\pm 1.60 93.45±2.2993.45\pm 2.29 99.70±0.3199.70\pm 0.31 78.78±2.0878.78\pm 2.08 100.00±0.01100.00\pm 0.01
test 91.67±4.2991.67\pm 4.29 77.45±2.7877.45\pm 2.78 92.13±1.9692.13\pm 1.96 83.83±1.8583.83\pm 1.85 92.44±1.4292.44\pm 1.42 72.40±1.2372.40\pm 1.23 78.46±7.0278.46\pm 7.02
MLP train 96.94±0.3296.94\pm 0.32 96.80±1.5796.80\pm 1.57 97.05±0.4097.05\pm 0.40 92.93±2.2792.93\pm 2.27 98.37±0.8398.37\pm 0.83 84.27±3.9484.27\pm 3.94 91.35±5.0991.35\pm 5.09
test 84.31±1.5084.31\pm 1.50 84.86±2.98\mathbf{84.86\pm 2.98} 85.37±1.4685.37\pm 1.46 84.64±1.5484.64\pm 1.54 93.26±1.1393.26\pm 1.13 67.97±1.3667.97\pm 1.36 75.77±2.6875.77\pm 2.68
ML-GRFF train 97.90±1.5097.90\pm 1.50 82.72±1.9682.72\pm 1.96 95.08±1.5695.08\pm 1.56 85.86±1.9385.86\pm 1.93 95.04±1.5895.04\pm 1.58 69.53±2.0169.53\pm 2.01 80.38±7.0880.38\pm 7.08
test 95.83±2.15\mathbf{95.83\pm 2.15} 79.72±1.1879.72\pm 1.18 93.70±0.99\mathbf{93.70\pm 0.99} 86.09±1.79\mathbf{86.09\pm 1.79} 94.59±1.42\mathbf{94.59\pm 1.42} 70.76±1.0370.76\pm 1.03 80.58±7.02\mathbf{80.58\pm 7.02}

Tab.1 and Tab.2 illustrate the training and test accuracy of different methods on the small-scale and large-scale data sets respectively. The sizes of the training and test sets and the data dimensions are also listed in the two tables. One can find that in most cases, ML-GRFF outperforms RFF and OPT-RFF a lot and achieves competitive (sometimes even slightly better) results compared with MLP.

Table 2: Training and test accuracy (%) on large-scale data sets of different models. The highest test accuracy is highlighted in bold.
method data set adult ijcnn phishing
(#tr; #te; dd) (32,561; 16,281; 123) (49,990; 91,701; 22) (5,528; 5,527; 68)
RFF train 84.50±0.2184.50\pm 0.21 95.35±0.1995.35\pm 0.19 94.53±0.3594.53\pm 0.35
test 84.25±0.1684.25\pm 0.16 93.18±0.1993.18\pm 0.19 92.63±0.6192.63\pm 0.61
OPT-RFF train 84.56±0.2984.56\pm 0.29 94.85±0.2894.85\pm 0.28 95.50±0.1495.50\pm 0.14
test 84.47±0.2884.47\pm 0.28 93.20±0.0593.20\pm 0.05 94.26±0.2294.26\pm 0.22
MLP train 85.05±0.0785.05\pm 0.07 98.58±0.4898.58\pm 0.48 97.76±0.5097.76\pm 0.50
test 84.70±0.1984.70\pm 0.19 98.54±0.23\mathbf{98.54\pm 0.23} 95.69±0.23\mathbf{95.69\pm 0.23}
ML-GRFF train 84.93±0.1484.93\pm 0.14 94.03±0.4294.03\pm 0.42 95.23±0.3295.23\pm 0.32
test 84.76±0.17\mathbf{84.76\pm 0.17} 94.60±0.5794.60\pm 0.57 95.20±0.3895.20\pm 0.38

4.3 Performance on Image Data and Adversarial Robustness

As aforementioned, the promising performance on generalization of the proposed one-stage, end-to-end kernel learning method is empirically validated on both synthetic and real-world data. In the following, we conduct another experiment on adversarial robustness to demonstrate the superiority of our GRFF.

4.3.1 Background and Motivation

The robustness of DNNs is demonstrated to be vulnerable and sensitive to imperceptible perturbations, a.k.a. adversarial examples [26, 27], created by white-box adversarial attackers, which need availability to the complete information of DNNs, e.g., model parameters [27, 28]. Hence, a line of adversarial defenses are based on an observation that adversarial examples created on one model by these attacks might totally lose efficacy on another one of different parameters [29], which inspires us to investigate the adversarial robustness of GRFF.

In GRFF, the independent samplings from the generators yield non-deterministic weights from the learned kernel distributions. In other words, every time the independent sampling produces a GRFF model of different parameters. Intuitively, such non-deterministic weights of GRFF raise the possibility of circumventing adversarial attacks on fixed parameters, thus bringing stronger adversarial robustness. In the remainder of this subsection, we first describe a variant of GRFF for image data, then evaluate its adversarial robustness against an iterative, gradient-based adversarial attack [28].

4.3.2 Variant of GRFF for Image Data

When dealing with image data, an image of C×H×WC\times H\times W size, where CC,HH,WW denote the number of channels, height and width respectively, can be simply stretched to a 1-dimensional vector. Then multiple 1-dimensional weight vectors of C×H×WC\times H\times W size could be generated to build the corresponding GRFFs of the image vector. However, it is impractical to directly include so many C×H×WC\times H\times W neurons in the output layer of the generator, resulting in extremely numerous parameters and heavy computational burden. Therefore, we design a variant of GRFF to deal with image data. In the following, we elaborate two main changes of this variant in detail: how to generate weights and how to build random features.

To avoid posing C×H×WC\times H\times W neurons at the output layer, the generators are now devised to generate small-size, 2-dimensional weights, which resemble the common convolution kernels and could efficiently reduce computational costs. Accordingly, the generator structure should be modified. Specifically, the generators are parameterized the same as those in DCGAN [17], instead of MLPs. Compared with the cascaded blocks in MLPs described in section 3.3 before, all the linear layers in the blocks are now substituted by transposed convolution [30] layers to generate 2-dimensional weights, and the leaky ReLUs are replaced by ReLUs. The others still remain the same.

The way to build random features for image data still basically follows the feature map ϕ\phi of Eq.(5). The inner product 𝐰T𝐱\mathbf{w}^{T}\mathbf{x} between weights 𝐰\mathbf{w} and vectors 𝐱\mathbf{x} is now replaced by the convolution on image via the generated 2-dimensional weights. The convoluted images then pass through cosine and sine functions respectively as indicated in Eq.(5), and the resulting two pieces are concatenated along the image channels, followed by a pooling operation. These manipulations could also be sequentially executed on the output of a preceding layer to build random features at the current layer, thus forming a multi-layer structure. After such multi-layer operations, i.e., convolution, cosine and sine activation, concatenation and pooling, the last random features are stretched to a 1-dimensional vector and an FC layer will make the predictions.

4.3.3 Performance against Adversarial Attacks

Omitting the generators in the variant of GRFF described above, the model itself will reduce to a Convolution Neural Network (CNN) with cosine and sine activation functions. The generators actually learn the convolution kernels in every layer of the CNN. During inference, by resampling from the learned distributions for many times, the generators will keep generating non-deterministic and independent convolution kernels, which every time will lead to a specific CNN. Hence, inspired by the randomized resampling mechanism associated with the distribution learning ability of GRFF, we investigate its application in defending the adversarial attacks.

In this subsection, we mainly exploit the robustness of GRFF against an iterative, gradient-based attack, called Iteratively, Least-Likely (Iter.L.L.) attack [28], which is developed from a one-step, gradient-based attack, Fast Gradient Sign Method (FGSM, [27]). By taking the gradient on input images and moving the images a small step along the direction of the sign of the gradient, FGSM creates the corresponding adversarial examples. Based on this one-step FGSM, to acquire better attack performance, Kurakin et al. [28] proposed the iterative version, i.e., Iter.L.L. attack. The steps for creating the adversarial example 𝐱𝐚𝐝𝐯\mathbf{x_{adv}} of an original example 𝐱\mathbf{x} can be summarized as follows,

𝐱𝐚𝐝𝐯(i+1)=Clip𝐱,ϵ{𝐱𝐚𝐝𝐯(i)},𝐱𝐚𝐝𝐯(i)𝐱𝐚𝐝𝐯(i)αsign(𝐱L(model(𝐱𝐚𝐝𝐯(i)),yll)),𝐱𝐚𝐝𝐯(0)=𝐱.\begin{split}\mathbf{x_{adv}}^{(i+1)}&=\mathrm{Clip}_{\mathbf{x},\epsilon}\{\mathbf{x_{adv}}^{(i)}\},\\ \mathbf{x_{adv}}^{(i)}&\leftarrow\mathbf{x_{adv}}^{(i)}-\alpha\cdot\mathrm{sign}(\nabla_{\mathbf{x}}L(\mathrm{model}(\mathbf{x_{adv}}^{(i)}),y_{ll})),\\ \mathbf{x_{adv}}^{(0)}&=\mathbf{x}.\end{split} (13)

At the ii-th iteration, Iter.L.L. attack computes the loss between the predicted label and the least-likely label ylly_{ll}, moves the image a small step α\alpha along the direction of the sign of the gradient of the input, and clips the pixel value of the perturbed image to a specified range [𝐱i,jϵ,𝐱i,j+ϵ][\mathbf{x}_{i,j}-\epsilon,\mathbf{x}_{i,j}+\epsilon]. The step size is set as α=1\alpha=1 by default. The number of iterations is set as min{ϵ+4,1.25ϵ}\min\{\epsilon+4,1.25\epsilon\}.

Iter.L.L. attack aims at linearizing the loss function around given fixed model parameters. While the weights in GRFF are generated independently and thus are non-deterministic. Hence, we evaluate the ability of GRFF in defending Iter.L.L. attack on MNIST [31] and have the following experiment design.

  1. 1.

    Given an image 𝐱\mathbf{x} and noises 𝐍(1)\mathbf{N}_{(1)} sampled from 0\mathbb{P}_{0}, GRFF can predict its label y^(0)\hat{y}_{(0)} and compute the loss between y^(0)\hat{y}_{(0)} and the least-likely label.

  2. 2.

    By back propagation, the gradients on 𝐱\mathbf{x} are determined. Then the corresponding adversarial example 𝐱𝐚𝐝𝐯\mathbf{x_{adv}} is created according to Eq.(13).

  3. 3.

    GRFF can again take the original noises 𝐍(1)\mathbf{N}_{(1)} as input and output a prediction y^(1)\hat{y}_{(1)} of 𝐱𝐚𝐝𝐯\mathbf{x_{adv}}.

  4. 4.

    By resampling independent noises 𝐍(2)\mathbf{N}_{(2)} from 0\mathbb{P}_{0}, now GRFF can take the new noises 𝐍(2)\mathbf{N}_{(2)} as input and output another prediction y^(2)\hat{y}_{(2)} on 𝐱𝐚𝐝𝐯\mathbf{x_{adv}}.

Based on the predicted results y^(0)\hat{y}_{(0)}, y^(1)\hat{y}_{(1)} and y^(2)\hat{y}_{(2)} above, we could determine the corresponding unattacked accuracy acc(0)acc_{(0)}, attacked accuracy with original noises acc(1)acc_{(1)}, and attacked accuracy with resampled noises acc(2)acc_{(2)}. It is expected that the attacked accuracies acc(1)acc_{(1)} and acc(2)acc_{(2)} should be smaller than acc(0)acc_{(0)}, and that acc(2)acc_{(2)} should be larger than acc(1)acc_{(1)} since resampling should be helpful in defending the Iter.L.L. attack.

In this experiment, for the variant of GRFF, we also adopt a two-layer structure, the generators of which generate 1616 and 88 weights of 5×55\times 5 size in the 1st and 2nd layers respectively. The max-pooling operator takes a perception field of 2×22\times 2 size. The experiment results are shown in Fig.7. By omitting the generators, we acquire a CNN with cosine and sine activation functions. We also record the results of the Iter.L.L. attack on such a CNN in Fig.7.

Refer to caption
Figure 7: Results of Iter.L.L. attack on GRFF and CNN. The blue solid curve and the magenta dotted curve denote the variations of acc(2)acc_{(2)} and acc(1)acc_{(1)} respectively. While acc(0)acc_{(0)} refers to the accuracy at ϵ=0\epsilon=0. The red dash-dot curve denotes the variation of the attacked accuracy of the CNN with cosine and sine activation functions.
Refer to caption
Figure 8: An illustration of some adversarial examples of Iter.L.L attack. Different ϵ\epsilon values correspond to the different rows. Among the 3 numbers at the top of each adversarial example, the first number denotes the true label, and the middle one denotes the prediction y^(1)\hat{y}_{(1)}, and the last one denotes the prediction y^(2)\hat{y}_{(2)}. These adversarial examples (except the first row) successfully fool GRFF without resampling. But GRFF successfully identifies these adversarial examples by resampling.

One can find that under Iter.L.L. attack, by resampling non-deterministic weights, acc(2)acc_{(2)} are generally larger than acc(1)acc_{(1)}, indicating stronger adversarial robustness brought by the resampling mechanism of GRFF. When ϵ\epsilon equals to 1212 with 1515 iterations, all the adversarial examples successfully fool CNN and GRFF with original noises and result in zero accuracy. While by resampling independent weights, GRFF successfully identifies nearly 40%40\% of these adversarial examples, which is a promising improvement. Besides, both the attacked accuracies with resampled noises and original noises are higher than the attacked accuracy of CNN. Therefore, due to the randomized resampling mechanism associated with GRFF, our model is able to alleviate the performance decrease brought by the Iter.L.L. attack. It also shows superiority over the common CNN of the same structure in defending the Iter.L.L. attack. Part of the adversarial examples are shown in Fig.8.

5 Discussion and Conclusion

In conclusion, we proposed a one-stage kernel learning approach, which models some latent distribution of the kernel via a generative network based on the random Fourier features. Not like the existing methods that learn the distribution by the target alignment and then train a linear classifier, we directly solved the ERM problem to jointly learn the features and the classifier for better generalization performance. Further, such an end-to-end manner enables the model itself to extend to deeper layers, which leads to a multi-layer structure and implies a good kernel on features. Besides, a progressively training strategy was proposed to efficiently train the multiple generators in an inverse, layer-by-layer order. Empirical results illustrate the superiority of ML-GRFF in classification tasks over the classical two-stage, RFFs-based methods. Meanwhile, GRFF enables us to resample independent and non-deterministic parameters from the generators. Such parameter randomness is helpful in defending adversarial attacks and is empirically verified by the enhanced robustness against the adversarial examples.

One of the limitations of the proposed method is that its performance on large-scale image data and deeper networks is still restricted. In our method, for image data, the generators try to learn the distributions on 3×33\times 3 or 5×55\times 5 weights. However, in deeper network, like ResNet [32], a usual size of kernels in the convolution layer is 512×512×3×3512\times 512\times 3\times 3. The generator cannot include 512×512×3×3512\times 512\times 3\times 3 neurons in its output layer since it is a heavy burden for GPUs to compute the gradients. Besides, more layers indicate more generators. Training tens of, or even hundreds of generators still remains a difficult problem.

Since the parameter distribution is modeled by the generative networks, which cost huge memory and computation resources, future work will focus on how to design cheap and efficient mechanism to model the parameter randomness and to achieve better robustness. Besides, based on the existing researches on the convergence of GANs [33, 34], a theoretical guarantee on the convergence of the GRFF will be developed in the future.

References

  • Scholkopf and Smola [2001] Bernhard Scholkopf and Alexander J Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
  • Filippone et al. [2008] Maurizio Filippone, Francesco Camastra, Francesco Masulli, and Stefano Rovetta. A survey of kernel and spectral methods for clustering. Pattern Recognition, 41(1):176–190, 2008.
  • Nazarpour and Adibi [2015] Abdollah Nazarpour and Peyman Adibi. Two-stage multiple kernel learning for supervised dimensionality reduction. Pattern Recognition, 48(5):1854–1862, 2015.
  • Lauriola et al. [2020] Ivano Lauriola, Claudio Gallicchio, and Fabio Aiolli. Enhancing deep neural networks via multiple kernel learning. Pattern Recognition, 101:107194, 2020.
  • Rahimi and Recht [2008] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances In Neural Information Processing Systems, pages 1177–1184, 2008.
  • Sinha and Duchi [2016] Aman Sinha and John C Duchi. Learning kernels with random features. In Advances In Neural Information Processing Systems, pages 1298–1306, 2016.
  • Li et al. [2019] Chun-Liang Li, Wei-Cheng Chang, Youssef Mroueh, Yiming Yang, and Barnabas Poczos. Implicit kernel learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2007–2016, 2019.
  • Cristianini et al. [2002] Nello Cristianini, John Shawe-Taylor, Andre Elisseeff, and Jaz S Kandola. On kernel-target alignment. In Advances In Neural Information Processing Systems, pages 367–373, 2002.
  • Rudin [1962] Walter Rudin. Fourier analysis on groups, volume 121967. Wiley Online Library, 1962.
  • Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Liu et al. [2020] Fanghui Liu, Xiaolin Huang, Yudong Chen, Jie Yang, and Johan Suykens. Random fourier features via fast surrogate leverage weighted sampling. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4844–4851, 2020.
  • Zhang et al. [2019] Wenyu Zhang, Zhenjiang Zhang, Lifu Wang, Han-Chieh Chao, and Zhangbing Zhou. Extreme learning machines with expectation kernels. Pattern Recognition, 96:106960, 2019.
  • Belkin et al. [2019] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • Liu et al. [2021] Fanghui Liu, Xiaolin Huang, Yudong Chen, and Johan A. K. Suykens. Random features for kernel approximation: A survey on algorithms, theory, and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021. doi: 10.1109/TPAMI.2021.3097011.
  • Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances In Neural Information Processing Systems, pages 2672–2680, 2014.
  • Radford et al. [2016] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference on Learning Representations, 2016.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In International Conference on Learning Representations, 2018.
  • Ukai et al. [2018] Kenya Ukai, Takashi Matsubara, and Kuniaki Uehara. Hypernetwork-based implicit posterior estimation and model averaging of cnn. In Asian Conference on Machine Learning, pages 176–191, 2018.
  • Ratzlaff and Fuxin [2019] Neale Ratzlaff and Li Fuxin. Hypergan: A generative model for diverse, performant neural networks. In International Conference on Machine Learning, pages 5361–5369, 2019.
  • Kawaguchi and Huang [2019] Kenji Kawaguchi and Jiaoyang Huang. Gradient descent finds global minima for generalizable deep neural networks of practical sizes. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 92–99. IEEE, 2019.
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
  • Srivastava et al. [2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • Wold et al. [1987] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
  • Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • Goodfellow et al. [2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  • Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017.
  • Fang et al. [2020] Kun Fang, Qinghua Tao, Yingwen Wu, Tao Li, Jia Cai, Feipeng Cai, Xiaolin Huang, and Jie Yang. Towards robust neural networks via orthogonal diversity. arXiv preprint arXiv:2010.12190, 2020.
  • Zeiler et al. [2010] Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pages 2528–2535. IEEE, 2010.
  • LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Liu et al. [2017] Shuang Liu, Olivier Bousquet, and Kamalika Chaudhuri. Approximation and convergence properties of generative adversarial learning. In Advances In Neural Information Processing Systems, pages 5545–5553, 2017.
  • Farnia and Ozdaglar [2020] Farzan Farnia and Asuman Ozdaglar. Do GANs always have Nash equilibria? In International Conference on Machine Learning, pages 3029–3039. PMLR, 2020.

Appendix A More discussions on ML-GRFF

In this section, discussions on ML-GRFF including over 2 layers are further raised. Specifically, experiments on synthetic data are designed and executed to illustrate the feasibility of the training algorithm Alg.1 even in the case of more than 2 layers. We show that a two-layer structure is an empirically-optimal choice, simultaneously considering the improved generalization performance and the difficulty of tuning hyper-parameters. The designed experiments and results are outlined below.

Settings:

The experiments are executed on synthetic data, where (𝐱,y)(\mathbf{x},y) follows 𝐱𝒩(0,Id)\mathbf{x}\sim\mathcal{N}(0,I_{d}) of dimension d=80d=80 and d=100d=100 with y=sign(𝐱2d)y=\mathrm{sign}(||\mathbf{x}||^{2}-\sqrt{d}). The training set includes 10410^{4} samples and the test set includes 10310^{3} samples. A validation set containing 20% of the training data is partitioned to evaluate the model performance during training.

For ML-GRFF with KK layers, we train multiple models w.r.t. K{1,2,3,4}K\in\big{\{}1,2,3,4\big{\}}. Important hyper-parameters, e.g., numbers of sampled noises {D1,D2,,DK}\big{\{}D_{1},D_{2},\cdots,D_{K}\big{\}} and numbers of epochs {epo1,epo2,,epoK}\big{\{}epo_{1},epo_{2},\cdots,epo_{K}\big{\}}, are listed in Table 3. The experiments are repeated 5 times and the average results are recorded in Fig.9.

Table 3: Hyper-parameters of training ML-GRFF with KK layers.
# layers KK {D1,D2,,DK}\big{\{}D_{1},D_{2},\cdots,D_{K}\big{\}} {epo1,epo2,,epoK}\big{\{}epo_{1},epo_{2},\cdots,epo_{K}\big{\}}
11 {256}\big{\{}256\big{\}} {1000}\big{\{}1000\big{\}}
22 {256,64}\big{\{}256,64\big{\}} {200,1000}\big{\{}200,1000\big{\}}
33 {64,64,64}\big{\{}64,64,64\big{\}} {200,200,1000}\big{\{}200,200,1000\big{\}}
44 {64,64,64,64}\big{\{}64,64,64,64\big{\}} {200,200,200,1000}\big{\{}200,200,200,1000\big{\}}
Refer to caption
Figure 9: Results on the test sets of synthetic data of ML-GRFF including KK layers, K{1,2,3,4}K\in\big{\{}1,2,3,4\big{\}}. The synthetic data is of 2 different dimensions, d=80d=80 and d=100d=100, denoted by distinct colors. The solid and dotted lines indicate the results of RFF and OPT-RFF respectively.

Results and discussions:

As indicated in Fig.9, as the number of layers increases over 2, the progressively training strategy could still guarantee the convergence of ML-GRFF and produce better results than RFF, OPT-RFF and single-layer GRFF. On the other hand, the numbers of hyper-parameters of ML-GRFF also increase along with that of layers. It remains a difficult problem to efficiently tune these hyper-parameters to always achieve significant generalization performance improvements in the case of more and more layers. Based on the results in Fig.9, empirically, a two-layer structure is an appropriate choice, and thus we focus on ML-GRFF of two layers in all the experiments.