Modularizing Deep Learning via Pairwise Learning With Kernels

Shiyu Duan, Shujian Yu, , and José C. Príncipe SD ([email protected]) and JCP ([email protected]) are with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL, 32611. SY ([email protected]) is with NEC Labs Europe, 69115 Heidelberg, Germany.Code available at: https://github.com/michaelshiyu/kerNET.git.

Abstract

By redefining the conventional notions of layers, we present an alternative view on finitely wide, fully trainable deep neural networks as stacked linear models in feature spaces, leading to a kernel machine interpretation. Based on this construction, we then propose a provably optimal modular learning framework for classification that does not require between-module backpropagation. This modular approach brings new insights into the label requirement of deep learning: It leverages only implicit pairwise labels (weak supervision) when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as 10 randomly selected labeled examples (one from each class) to achieve 94.88% accuracy on CIFAR-10 using a ResNet-18 backbone. Moreover, modular training enables fully modularized deep learning workflows, which then simplify the design and implementation of pipelines and improve the maintainability and reusability of models. To showcase the advantages of such a modularized workflow, we describe a simple yet reliable method for estimating reusability of pre-trained modules as well as task transferability in a transfer learning setting. At practically no computation overhead, it precisely described the task space structure of 15 binary classification tasks from CIFAR-10.

Index Terms:

kernel methods, deep learning, neural networks, modular training, task transferability

I Introduction

Understanding the connections between neural networks (NNs) and kernel methods has been a long-standing goal in machine learning research [1]. Recently, there has been a resurgence of interest in this direction, leading to important insights and powerful algorithms [2, 3, 4, 5, 6, 7, 8]. The established connections require highly nontrivial assumptions: The equivalence between a particular kernel method and a family of NNs only exists in infinitely wide NNs and/or in expectation of random NNs (only weakly-trained [7]). Moreover, existing algorithms using kernels inspired by NNs typically have prohibitively high computational complexity (super-quadratic in sample size, to be exact [6]).

In this paper, we propose a simple method to link fully trainable, finitely wide NNs to kernel machines (KMs), i.e., linear models in feature spaces (reproducing kernel Hilbert spaces (RKHSs)) (Fig. 1). Specifically, as opposed to the existing literature, where the nonlinearity is considered the last component of an NN layer or module (composition of layers), we consider it to be the first component of the next layer. Then layers in deep NN can be identified as linear models in feature spaces connected by potentially nonlinear feature maps. These linear models can be proven to be KMs with kernels induced by the connecting feature maps. The benefits of our construction are three-fold: First, the connections are established for fully trainable and finitely wide NNs, and are exact in the sense that the trainable parameters of the NN layers are exactly those of the corresponding KMs. Second, our method works with any NN layer, fully-connected or convolutional, with minimal adjustments. Last but not least, these NN-equivalent KMs run in linear time, contrasting the super-quadratic runtime of existing algorithms.

We then turn to another important yet understudied problem: How can we modularize deep learning (DL)? In engineering and software engineering in particular, modularization is the core of any scalable workflow. Dividing the design into modules, optimizing them individually, and then wiring them together simplifies implementation via enabling unit tests, and enhances maintainability and reusability. There is currently no reliable way to completely modularize a deep NN, however, since there is no modular learning approach that provably matches the performance of end-to-end training. As a result, one is forced to design and optimize the entire model as a whole, configuring hundreds of hyperparameters simultaneously. When performance is unsatisfying, it is practically impossible to trace the source of the problem to a particular design choice. Even if one succeeded in training a good model, reusing that model for a new task is highly nontrivial: Which part of the model is more reusable for what task?

To enable fully modularized DL, we develop a provably optimal modular training framework for deep NNs in classification that trains each module separately, yet still finds the overall loss function minimizer as an end-to-end approach would. Focusing on the two-module case, we prove that the training of input and output modules can be decoupled by leveraging pairwise kernel evaluations on training examples from distinct classes. This kernel is defined by the output module’s nonlinearity. It suffices to have the input module optimize a proxy objective function that does not involve the trainable parameters of the output one. This removes the need for error backpropagation between modules. On MNIST and CIFAR-10, our modular approach compares favorably against end-to-end training in terms of accuracy.

Our modular learning utilizes labels more efficiently than the existing end-to-end training paradigm. Specifically, training of the input module involves only pairs of examples from distinct classes with no need for knowing the actual classes. This is a weaker form of supervision than knowing exactly which class each example belongs to as required by backpropagation. And we empirically show that the output module, which requires full supervision in training, is highly label-efficient, achieving $94.88\%$ accuracy on CIFAR-10 with $10$ randomly selected labeled examples (one from each class) using a ResNet-18 [9] backbone ( $94.93\%$ when using all $50000$ labels). Overall, our modular training can leverage a weaker form of supervision yet still produce models as performant as those obtained from end-to-end backpropagation, indicating that the existing form of supervision and backpropagation is not efficient enough. This can potentially enable less costly procedures for acquiring labeled data and more powerful un/semi-supervised learning algorithms.

To showcase one of the main benefits of modularization — module reuse with confidence — we demonstrate that one can easily and reliably quantify the reusability of a pre-trained module on a new target task with our proxy objective function. This provides a fast yet effective solution to an important practical issue in transfer learning. Moreover, this method can be extended to measure task transferability, a central problem in transfer learning, continual/lifelong learning, and multi-task learning [10]. Unlike many existing methods, our approach requires no training, is task agnostic, flexible, and completely data-driven. Nevertheless, it accurately described the task space structure on $15$ binary classification tasks derived from CIFAR-10 using only a small amount of labeled data.

To summarize, our main contributions are as follows.

•

We present a simple method that establishes connections between fully trainable, finitely wide NNs and KMs, contrasting earlier efforts that always assume random networks or require infinite widths. (Sec. III)
•

We present a modular training framework, opening up new possibilities for modularized DL. Focusing on the two-module case, we provide an optimality proof for our learning approach, guaranteeing that it finds the loss function minimizer without between-module backpropagation. As empirical validation, we demonstrate its strong performance on MNIST and CIFAR-10. (Sec. IV)
•

Our modular learning approach sheds new light on the label requirement of DL: It relies almost only on implicitly-labeled example pairs, achieving state-of-the-art accuracy on CIFAR-10 with a single randomly chosen fully-labeled example per class. (Sec. VII-D)
•

We propose a simple but effective method to quantify module reusability and task transferability using components from our modular training framework, demonstrating that modularized DL enables solutions to important issues in transfer learning, lifelong learning, etc. (Sec. V)

Refer to caption — Figure 1: Viewing the models from a new perspective, we identify the kernel machines “hidden” in neural networks. Specifically, by absorbing the trailing nonlinearity of a layer or a module (composition of layers) into the next layer, layers become linear models in feature spaces. These linear models can be shown to be kernel machines. Best viewed in color.

II Notations and Background

II-A Notations

Throughout, we use bold capital letters for matrices and tensors, bold lower-case letters for vectors, and unbold letters for scalars. $(\mathbf{v})_{i}$ denotes the $i^{\text{th}}$ component of vector $\mathbf{v}$ . And $\mathbf{W}^{(j)}$ denotes the $j^{\text{th}}$ column of matrix $\mathbf{W}$ . For a 3D tensor $\mathbf{X}$ , $\mathbf{X}[:,:,c]$ denotes the $c^{\text{th}}$ matrix indexing along the third dimension from the left (or the $c^{\text{th}}$ channel). We use $\langle\cdot,\cdot\rangle_{H}$ to denote the inner product in an inner product space $H$ . And the subscript shall be omitted if doing so causes no confusion. For functions, we use capital letters to denote vector/matrix/tensor-valued functions, and lower-case letters are reserved specifically for scalar-valued ones. In a network, we call a composition of an arbitrary number of layers as a module for convenience.

Since we shall propose an alternative view on NNs that redefine the network layers and also modules, it is helpful to introduce notations to distinguish our view and the conventional one. We use letter $G_{i}$ (or $g_{i}$ , depending on if this layer or module is a scalar-valued function) with a numeric subscript $i\in\mathbb{N}\setminus\{0\}$ to refer to the $i^{\text{th}}$ network layer or module under the conventional view and letter $F_{i}$ (or $f_{i}$ ) the $i^{\text{th}}$ layer or module (of the same model) under our view. Example usages can be found in Fig. 1 and Sec. III.

II-B Kernels and Kernel Machines

Kernel machines can be considered as linear models in feature spaces, the mappings into which are potentially nonlinear [11]. Consider a feature map $\Phi:\mathbb{R}^{d}\to H$ , where $H$ is a (real) Hilbert space, i.e., an inner product space over the real numbers that is complete in the metric induced by the inner product, a kernel machine is a linear model in this feature space $H$ :

f(\mathbf{x})=\langle\mathbf{w},\Phi(\mathbf{x})\rangle_{H}+b,\mathbf{w}\in H,b\in\mathbb{R},

(1)

where $\mathbf{w},b$ are its trainable weights and bias, respectively. For certain feature spaces $H$ (called RKHSs), one can define a bivariate “kernel” function $k$ that is symmetric and positive semidefinite via

k(\mathbf{u},\mathbf{v}):=\langle\Phi(\mathbf{u}),\Phi(\mathbf{v})\rangle_{H}.

(2)

This definition is often used as an identity and is referred to as the “kernel trick” in some more modern texts [11].¹¹1The correspondence between $k$ and $\Phi$ holds true in both directions and is sometimes stated the other way around. Namely, per Moore-Aronszajn Theorem, for every symmetric, continuous, positive semidefinite bivariate function $k$ that maps into $\mathbb{R}$ , one can find an RKHS $H$ such that $k(\mathbf{u},\mathbf{v})=\langle\Phi(\mathbf{u}),\Phi(\mathbf{v})\rangle_{H}$ , where the feature map $\Phi$ is defined through $k$ as $\Phi(\mathbf{u}):=k(\mathbf{u},\cdot)$ [12].

An RKHS, different from a general Hilbert space, possesses a “canonical coordinate system” induced by the kernel [13] (in addition to the regular coordinate system induced by its basis) that enables one to concisely express distance between feature vectors using kernel values, thanks to the reproducing property [14]. Concretely, we have

\|\Phi(\mathbf{u})-\Phi(\mathbf{v})\|_{H}^{2}=k(\mathbf{u},\mathbf{u})+k(\mathbf{v},\mathbf{v})-k(\mathbf{u},\mathbf{v}).

(3)

One can use the “kernel trick” to implement highly nontrivial feature maps (thus highly complicated functions in the input space). For feature maps that are not implementable, e.g., when $\Phi(\mathbf{x})$ is an infinite series, one can approximate the KM with a set of “centers” $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}$ as follows:

f(\mathbf{x})\approx\langle\mathbf{w},\Phi(\mathbf{x})\rangle_{H}+b,\mathbf{w}\in\operatorname{span}\{\Phi(\mathbf{x}_{1}),\ldots,\Phi(\mathbf{x}_{n})\},b\in\mathbb{R},

(4)

Assuming a $k$ satisfying Eq. 2 can be found and using the “kernel trick”, the right hand side is equal to

\sum_{i}^{n}\alpha_{i}k(\mathbf{x}_{i},\mathbf{x})+b,\alpha_{i},b\in\mathbb{R}.

(5)

Now, the learnable parameters become the $\alpha_{i}$ ’s anb $b$ .

This approximation changes the computational complexity of evaluating the kernel machine on a single example from $\mathcal{O}(h)$ , where $h$ is the dimension of the feature space $H$ and can be infinite for certain kernels, to $\mathcal{O}(nd)$ , where $d$ is the dimension of the input space. In practice, the runtime for a sample is quadratic in sample size since $n$ is typically on the same order as the sample size. There exists acceleration methods that reduce the complexity via further approximation in this case (e.g., [15]), yet the compromise in performance can be nonnegligible in practice especially when the input space dimension is large.

III Revealing the Hidden Kernel Machines in Neural Networks

In this section, we present a method revealing the hidden KMs in NNs. The idea is simple: Instead of considering the nonlinearity to be the last component of a layer or a module (call it the $i^{\text{th}}$ layer or module here for convenience), we consider it to be the first component of the next layer, i.e., layer $i+1$ . After potentially repeating this process for layer $i+1$ to get rid of its trailing nonlinearity, we turned layer $i+1$ into a layer of linear models in feature spaces. These linear models can be shown to be KMs (Fig. 1). We first present the method for fully-connected networks and then extend to CNNs [16]. Contrasting existing works (e.g., [7]), the extension requires minimal adjustments only.

III-A The Methodology

III-A1 Fully-Connected Neural Networks

We use a one-hidden-layer NN as an illustrative example. The idea scales easily to deeper models. Note that we assume the output layer has a single neuron. When the output layer has multiple neurons, one can apply the same analysis to each of these output neurons. Finally, we assume the output layer to be linear without loss of generality since if there is a trailing nonlinearity at the output, it can be viewed as a part of the loss function instead of a part of the network.

Considering an input vector $\mathbf{x}\in\mathbb{R}^{d_{0}}$ , the model $f=g_{2}\circ G_{1}$ is given as

	$\displaystyle G_{1}(\mathbf{x})=\Phi(\mathbf{W}_{1}^{\top}\mathbf{x})\in\mathbb{R}^{d_{1}};$		(6)
	$\displaystyle f(\mathbf{x})=g_{2}(G_{1}(\mathbf{x}))=\mathbf{w}_{2}^{\top}G_{1}(\mathbf{x}),$		(7)

where $\Phi$ is an elementwise nonlinearity such as ReLU [17] with $\Phi(\mathbf{v}):=\left(\phi\left((\mathbf{v})_{1}\right),\ldots,\phi\left((\mathbf{v})_{d_{1}}\right)\right)^{\top}$ for any $\mathbf{v}\in\mathbb{R}^{d_{1}}$ and some $\phi:\mathbb{R}\to\mathbb{R}$ , $G_{1}$ the input layer, $g_{2}$ the output layer, $f(\mathbf{x})$ the final model output, and $\mathbf{W}_{1}$ and $\mathbf{w}_{2}$ the learnable weights of the NN model.

There are different ways of dividing this particular model into “layers”. Indeed, without changing the input-output mapping, we can redefine the layers as follows.

\displaystyle f(\mathbf{x})=\langle\mathbf{w}_{2},\Phi(\mathbf{W}_{1}^{\top}\mathbf{x})\rangle_{\mathbb{R}^{d_{1}}}=\langle\mathbf{w}_{2},\Phi(F_{1}(\mathbf{x}))\rangle_{\mathbb{R}^{d_{1}}}=f_{2}(F_{1}(\mathbf{x})),

(8)

where $F_{1}(\mathbf{x})=\left(\langle\mathbf{W}_{1}^{(1)},\mathbf{x}\rangle_{\mathbb{R}^{d_{0}}},\ldots,\langle\mathbf{W}_{1}^{(d_{1})},\mathbf{x}\rangle_{\mathbb{R}^{d_{0}}}\right)^{\top}$ and $\langle\cdot,\cdot\rangle_{\mathbb{R}^{k}}$ is the canonical inner product of $\mathbb{R}^{k}$ for $k=d_{0},d_{1}$ , i.e., the dot product. In other words, we treat $F_{1}$ as the new input layer and absorb the nonlinearity $\Phi$ into the old output layer, forming the new output layer $f_{2}$ such that each layer is now one or multiple parallel linear models (models linear in their weights, not necessarily in their inputs) (Fig. 1).

Then one-hidden layer NN can be interpreted as a composition of KMs with finite-dimensional RKHSs. Using Eq. 2, the linear models on the input layer are simply KMs with feature maps being the identity map on $\mathbb{R}^{d_{0}}$ , inducing the identity kernel, i.e., $k_{I}(\mathbf{u},\mathbf{v}):=\langle\mathbf{u},\mathbf{v}\rangle_{\mathbb{R}^{d_{0}}}$ . On the other hand, the linear model on the output layer, i.e., $f_{2}$ , can be easily identified as a KM with feature map being $\Phi$ , (and together with the input layer) inducing kernel $k\left(F_{1}(\mathbf{u}),F_{1}(\mathbf{v})\right):=\left\langle\Phi\left(F_{1}(\mathbf{u})\right),\Phi\left(F_{1}(\mathbf{v})\right)\right\rangle_{\mathbb{R}^{d_{1}}}$ .

III-A2 Convolutional Neural Networks

The above method for fully-connected networks can be extended to CNNs as follows. We discuss here simple 2D convolutional networks and the same idea generalizes to more complicated models easily.

Similar to the fully-connected case, we first absorb the trailing elementwise nonlinearity $\Phi$ of each layer or module of interest into the next layer. Suppose the activation tensor of a given layer or module is $\mathbf{X}\in\mathbb{R}^{H}\times\mathbb{R}^{W}\times\mathbb{R}^{C}$ , then each channel of the next layer is a matrix of KMs sharing the same weights with the $ij^{\text{th}}$ KM having feature map: $\Phi_{ij}(\mathbf{X}):=r_{ij}\circ\Phi(\mathbf{X})$ , where $r_{ij}:\mathbb{R}^{H}\times\mathbb{R}^{W}\times\mathbb{R}^{C}\to\mathbb{R}^{h\times w\times C}:\mathbf{Z}\mapsto[s_{ij}(\mathbf{Z}[:,:,1]),...,s_{ij}(\mathbf{Z}[:,:,C])]$ , $s_{ij}$ denotes the operator that returns a $h\times w$ vectorized receptive field centered at a specific location (depending on $ij$ ) upon receiving a matrix, and $[\cdot,...,\cdot]$ denotes vector concatenation. This is illustrated in Fig. 2.

There are two main differences compared to the fully-connected case. First, in the convolutional case, the KMs on each channel in a given layer share the same weights whereas in the fully-connected case, there is no such restriction. Second, these KMs have distinct kernel functions, unlike in the fully-connected case where they share kernels. These observations allow for a new perspective in interpreting CNNs. Indeed, the fact that each convolutional layer is essentially KMs using distinct kernels but sharing weights suggests that it can be viewed as an instantiation of the multiple kernel learning framework that has been studied in the kernel method literature for decades [18]. However here the composition is different because it is an embedding of functions.

III-B Implementable Feature Maps and Linear Runtime

Note that, for common NNs, the kernels involved in the above methods have implementable feature maps with feature space dimensions being equal to the widths of the corresponding NN layers (so they are always finite). Therefore, one no longer needs to rely on the kernel trick and approximate the kernel machines through pairwise evaluations on a set of centers — one can simply use the explicit linear model representation $f(\mathbf{x})=\langle\mathbf{w},\Phi(\mathbf{x})\rangle+b$ ! The major advantage is that the computational complexity of running the model on a single data example becomes $\mathcal{O}(p)$ , where $p$ is the size of the corresponding NN layer (or equivalently, the size of $\Phi(\mathbf{x})$ ). This reduces the runtime over a set of data from the usual super-quadratic to linear in sample size, making the model much more practical.

IV Provably Optimal Modular Learning

Constructing a unified model class that encompasses both NNs and KMs yields immediately practical implications. Namely, existing algorithms and results for KMs can now be directly applied to NNs and vice versa. Specifically, one may expect the theory behind KMs (see, e.g., [19, 14, 20]) to significantly enrich our understanding of NNs and enable novel algorithms.

In this section, we propose a provably optimal modular training framework for NNs in classification, which enables fully modularized DL workflows. We focus on the two-module case. The idea can be easily generalized to enable modular training with more than two modules by analyzing one pair of modules at a time. We leave the corresponding optimality proofs as a future work. Applying this framework to NNs essentially relies on viewing the output module as a KM and utilizing its linearity in an RKHS and properties of the RKHS. This demonstrates how the interplay of NN and KM theories can produce useful results.

IV-A Set-Up, Goal, and an Idea

Suppose we have a deep model consisting of two modules $F=F_{2}\circ F_{1}$ and an objective function $L(F,S)$ , where $S$ is a training set $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ . $F_{1},F_{2}$ can be compositions of arbitrary layers.

A modular learning algorithm trains $F_{1}$ , freezes it afterwards, then trains $F_{2}$ . And the goal is that after wiring together the two trained modules, the overall model minimizes $L$ .

For a given $S$ , define

\mathcal{F}_{1}^{\star}:=\{F_{1}:\exists F_{2}\text{ s.t. }F_{2}\circ F_{1}\in\operatorname*{arg\,min}_{F}L(F,S)\}.

(9)

Clearly, to obtain an $F$ that minimizes $L$ , the goal when training $F_{1}$ can be to find an $F_{1}^{\prime}$ that is in $\mathcal{F}_{1}^{\star}$ . Then one can simply train $F_{2}$ to minimize $L(F_{2}\circ F_{1}^{\prime},S)$ and the resulting minimizer $F_{2}^{\prime}$ will satisfy $F_{2}^{\prime}\circ F_{1}^{\prime}\in\operatorname*{arg\,min}_{F}L(F,S)$ . And if we can characterize $\mathcal{F}_{1}^{\star}$ independently of the trainable parameters of $F_{2}$ , the training of $F_{1}$ can be decoupled from that of $F_{2}$ .

IV-B Main Result

Earlier we have shown how to redefine NN layers such that they become KMs, but we have not yet demonstrated any practical advantage of such a representation. In this section, we show that under this KM view on the output module, the optimal input module can be described independently of the output module, which, as stated above, is the core idea motivating our modular learning algorithm. For simplicity, we discuss only binary classification. The result easily extend to classification with more classes.

Concretely, we present a result stating that if $F_{2}$ is a single KM with kernel $k$ , $\mathcal{F}_{1}^{\star}$ can be characterized independently of the trainable parameters of $F_{2}$ via pairwise evaluations of the kernel $k$ on training data from distinct classes. The proof of the following theorem is given in the Appendix.

Theorem IV.1.

Let $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n},\mathbf{x}_{i}\in\mathbb{R}^{d_{0}},y_{i}\in\{+,-\},\forall i$ , be given and consider $F_{1}:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{1}},f_{2}:\mathbb{R}^{d_{1}}\to\mathbb{R}:\mathbf{z}\mapsto\langle\mathbf{w},\phi(\mathbf{z})\rangle+b$ , where $\mathbf{w},b$ are free parameters and $\phi$ is a given mapping into a real inner product space with $\|\phi(\mathbf{u})\|=\alpha$ for all $\mathbf{u}\in\mathbb{R}^{d_{1}}$ and some fixed $\alpha>0$ .²²2Throughout, we consider the natural norm induced by the inner product, i.e., $\|\mathbf{t}\|^{2}:=\langle\mathbf{t},\mathbf{t}\rangle,\forall\mathbf{t}$ . Let $I_{+}$ be the set of $i$ ’s in $\{1,\ldots,n\}$ such that $y_{i}=+$ . And let $I_{-}$ be the set of $j$ ’s in $\{1,\ldots,n\}$ such that $y_{j}=-$ . Suppose the objective function $L$ admits the following form:

	$\displaystyle L(f_{2}\circ F_{1},S)$
	$\displaystyle\quad=\frac{1}{n}\sum_{i\in I_{+}}\ell_{+}\left(f_{2}\circ F_{1}(\mathbf{x}_{i})\right)+\frac{1}{n}\sum_{j\in I_{-}}\ell_{-}\left(f_{2}\circ F_{1}(\mathbf{x}_{j})\right)$
	$\displaystyle\quad\quad\quad+\lambda g(\\|\mathbf{w}\\|),$		(10)

where $\lambda\geq 0$ , $g,\ell_{+},\ell_{-}$ are all real-valued with $g,\ell_{-}$ nondecreasing and $\ell_{+}$ nonincreasing.

For an $F_{1}^{\star}$ , let $f_{2}^{\star}$ be in $\operatorname*{arg\,min}_{f_{2}}L(f_{2}\circ F_{1}^{\star},S)$ . If $\forall i\in I_{+},j\in I_{-}$ , $F_{1}^{\star}$ satisfies

\|\phi(F_{1}^{\star}(\mathbf{x}_{i}))-\phi(F_{1}^{\star}(\mathbf{x}_{j}))\|\geq\|\phi(\mathbf{s})-\phi(\mathbf{t})\|,\forall\mathbf{s},\mathbf{t}\in\mathbb{R}^{d_{1}},

(11)

then

f_{2}^{\star}\circ F_{1}^{\star}\in\operatorname*{arg\,min}_{f_{2}\circ F_{1}}L(f_{2}\circ F_{1},S).

(12)

Remark: Defining kernel

k\left(F_{1}(\mathbf{u}),F_{1}(\mathbf{v})\right)=\left\langle\phi\left(F_{1}(\mathbf{u})\right),\phi\left(F_{1}(\mathbf{v})\right)\right\rangle,

(13)

Eq. 11 is equivalent to

k\left(F_{1}^{\star}(\mathbf{x}_{i}),F_{1}^{\star}(\mathbf{x}_{j})\right)\leq k\left(\mathbf{s},\mathbf{t}\right),\forall\mathbf{s},\mathbf{t}\in\mathbb{R}^{d_{1}},\forall i\in I_{+},j\in I_{-}.

(14)

Further, if the infimum of $k(\mathbf{u},\mathbf{v})$ is attained in $\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{1}}$ and equals $\beta$ , then Eq. 11 is equivalent to

k\left(F_{1}^{\star}(\mathbf{x}_{i}),F_{1}^{\star}(\mathbf{x}_{j})\right)=\beta,\forall i\in I_{+},j\in I_{-}.

(15)

Interpreting a two-module classifier $F_{2}\circ F_{1}$ as $F_{1}$ learning a new representation of the given data on which $F_{2}$ will carry out the classification, the intuition behind our result can be explained as follows. Given some data, its “optimal” representation for a linear model to classify should be the one where examples from distinct classes are located as distant from each other as possible. Thus, the optimal $F_{1}$ should be the one that produces this optimal representation. And since our $F_{2}$ is a linear model in an RKHS feature space and because distance in an RKHS can be expressed via evaluations of its reproducing kernel, the optimal $F_{1}$ can then be fully described with only pairwise kernel evaluations over the training data, i.e., $k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)$ for $\mathbf{x}_{i},\mathbf{x}_{j}$ being training examples from different classes. In other words, an RKHS provides a useful “canonical coordinate system” [13] that enables a concise definition of the optimal input module for any linear classifier in that RKHS using only kernel evaluations.

IV-C Applicability of the Main Result

The earlier Theorem imposes assumptions on the network architecture and the classification objective function. We now show that these assumptions are satisfied in many standard classification set-ups.

IV-C1 A Two-Module View on Neural Networks

Most popular networks, including the ResNet family [9], has the following representation:

F=G_{2}\circ G_{1},

(16)

where $G_{2}$ is a output linear layer: $\left(G_{2}(\mathbf{x})\right)_{i}=\langle\mathbf{w}_{i},\mathbf{x}\rangle+b_{i}$ and $G_{1}$ is a composition of arbitrary previous layers ending with a nonlinearity $\Phi$ . When the model also uses a nonlinearity on top of the output linear layer, one may absorb the nonlinearity into the loss function such that the model would still assume the aforementioned representation.

Considering the ending nonlinearity of $G_{1}$ as the beginning component of the output layer, we can view the output layer as the KM output module and apply Theorem IV.1. To be specific, write $G_{1}=\Phi\circ F_{1}$ for some $F_{1}$ , we can rewrite the model such that it satisfies the condition of Theorem IV.1:

	$\displaystyle\left(F(\mathbf{x})\right)_{i}=\langle\mathbf{w}_{i},G_{1}(\mathbf{x})\rangle+b_{i}=\langle\mathbf{w}_{i},\Phi\left(F_{1}(\mathbf{x})\right)\rangle+b_{i},$		(17)
	$\displaystyle k(F_{1}(\mathbf{u}),F_{1}(\mathbf{v})):=\langle\Phi(F_{1}(\mathbf{u})),\Phi(F_{1}(\mathbf{v}))\rangle.$		(18)

Note that the assumption that $\|\Phi(\mathbf{u})\|$ is fixed for all $\mathbf{u}$ may require that one normalizes the activation vector/matrix/tensor in practice, which is the only modification one has to make for a standard NN to satisfy the conditions of the theorem.

IV-C2 Loss Functions

In terms of loss functions, many common ones including softmax $+$ cross-entropy (two-class version), any monotonic nonlinearity $+$ mean squared error, and hinge loss, admit the required representation. We provide details in the Appendix.

IV-D From Theory to Algorithm

To convert the theoretical result into an implementable learning algorithm for classification (with potentially more than two classes), one may proceed as follows. Let an NN $G_{2}\circ G_{1}$ be given, where $G_{2}$ is a linear layer (absorb the trailing nonlinearity into the loss function if necessary) and $G_{1}=\Phi\circ F_{1}$ is a composition of arbitrary layers followed by nonlinearity $\Phi$ . Suppose the activation vector of $G_{1}$ is normalized such that it is a unit vector by (elementwise) dividing the activation vector by its norm. Also let a loss function $L$ be given and suppose it (or its two-class analog) satisfies the requirements of Theorem IV.1. Define kernel $k(F_{1}(\mathbf{u}),F_{1}(\mathbf{v}))=\langle\Phi(F_{1}(\mathbf{u})),\Phi(F_{1}(\mathbf{v}))\rangle$ . Determine $\beta:=\min k$ based on $\Phi$ . Some examples include: $\beta=0$ for ReLU and sigmoid; $\beta=-1$ for tanh. Denote $G_{2}\circ\Phi$ as $F_{2}$ . The training consists of two stages: Training $F_{1}$ and training $F_{2}$ (without touching $F_{1}$ ).

Training $F_{1}$ . Given a batch of training data $\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n}$ and let $\mathcal{N}$ (for negative) denote all pairs of indices $i,j$ such that $y_{i}\neq y_{j}$ and $\mathcal{P}$ (for positive) denote all pairs of indices $i\neq j$ with $y_{i}=y_{j}$ . Train $F_{1}$ to maximize one of the following proxy objective functions.

•

Alignment (negative only) (AL-NEO):

L_{1}(F_{1})=\frac{\beta\sum_{(i,j)\in\mathcal{N}}k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)}{|\beta||\mathcal{N}|\sqrt{\sum_{(i,j)\in\mathcal{N}}\left(k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)\right)^{2}}};

(19)

•

Contrastive (negative only) (CTS-NEO):

L_{1}(F_{1})=-\frac{1}{|\mathcal{N}|}\sum_{(i,j)\in\mathcal{N}}\exp\left(k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)\right);

(20)

•

Negative Mean Squared Error (negative only) (NMSE-NEO):

L_{1}(F_{1})=-\frac{1}{|\mathcal{N}|}\sum_{(i,j)\in\mathcal{N}}\left(k\left(F_{1}({x}_{i}),F_{1}(\mathbf{x}_{j})\right)-\beta\right)^{2}.

(21)

All of the above proxy objectives can be shown to learn an $F_{1}$ that satisfies the optimality condition required by Theorem IV.1.

Note that in cases where some of these proxies are undefined (for example, when $\beta=0$ ), we may train $F_{1}$ to maximize the following alternative proxy objectives instead assuming $\alpha:=\sup k$ is known, where we define $k^{\star}_{ij}$ to be $\alpha$ if $(i,j)\in\mathcal{P}$ or if $i=j$ and $\beta$ if otherwise. Compared to the previous proxies, these are applicable to all $k$ ’s and impose a stronger constraint on learning. Specifically, in addition to controlling the inter-class pairs, these proxies force intra-class pairs to share representations.

•

Alignment (AL): The kernel alignment [21] between the kernel matrix formed by $k$ and $k^{\star}$ on the given data.
•

Upper Triangle Alignment (UTAL): Same as AL, except only the upper triangles minus the main diagonals of the two matrices are considered. This can be considered a refined version of the raw alignment.

•

Contrastive (CTS):

L_{1}(F_{1})=\frac{\sum_{(i,j)\in\mathcal{P}}\exp\left(k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)\right)}{\sum_{(i,j)\in\mathcal{N}\cup\mathcal{P}}\exp\left(k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)\right)};

(22)

•

Negative Mean Squared Error (NMSE):

L_{1}(F_{1})=-\frac{1}{n^{2}}\sum_{(i,j)}\left(k\left(F_{1}(\mathbf{x}_{i}),F_{1}(\mathbf{x}_{j})\right)-k^{\star}_{ij}\right)^{2}.

(23)

Empirically, we found that alignment and squared error typically produced better results than the contrastive ones, which is potentially due to how the exponential term involved in the contrastive objectives made the range of ideal learning rates to be smaller.

Training $F_{2}$ . Now suppose $F_{1}$ has been trained and frozen at $F_{1}^{\prime}$ , we simply train $F_{2}$ to minimize the overall loss function $L(F_{2}\circ F_{1}^{\prime})$ . This process is illustrated in Fig. 3.

V A Method for Task Transferability Estimation

In this section, we demonstrate the improved module reusability of modularized DL. Let a source task be defined as one that we have already solved and a target task be one that we are interested in solving. Specifically, we consider the following practical issue in transfer learning: Assume one is given a set of input modules pre-trained on some source tasks and a target task to be solved by training an output module on top of a frozen input module. Which input module will result in the most performant network on the target task?

Underlying this practical problem is a theoretical issue that is central to many important research domains including transfer learning, continual/lifelong learning, meta learning, and multi-task learning: Given a set of tasks, how to effectively model the task space structure in terms of the transferability between task pairs, i.e., how helpful knowledge about a source task is for solving a target task [22, 23, 10, 24].

We propose the following solution to the module reusability (and also task transferability) problem. Let a target task be characterized by the loss function $L(F_{2}\circ F_{1},S)$ , where $L$ is a loss function assuming the representation in Theorem IV.1 and $S$ is some training data [10]. Our modular learning framework suggests that we can measure the goodness of these pre-trained $F_{1}$ ’s for this target task by measuring $L_{1}(F_{1},S)$ , where $L_{1}$ is a proxy objective. This is based on the fact that the $F_{1}$ that maximizes the proxy objective $L_{1}(F_{1},S)$ constitutes the input module of a minimizer for $L(F_{2}\circ F_{1},S)$ . Therefore, one can simply rank the pre-trained $F_{1}$ ’s in terms of how well they maximize $L_{1}(F_{1},S)$ and select the maximizer. Since a well-trained input module must encode useful information for the source task, this procedure can also fully describe the task space structure of the given tasks. The benefit is that this procedure requires no training. Indeed, one only needs to run the given modules on potentially a subset of $S$ . Moreover, this method is task agnostic, flexible, and completely data-dependent.

In terms of the optimality of our transferability measure, if we define the true transferability of a particular $F_{1}$ to be $\min_{F_{2}}\mathbb{E}_{(\mathbf{x},y)}L(F_{2}\circ F_{1},(\mathbf{x},y))$ [10], then $\min_{F_{2}}L(F_{2}\circ F_{1},S)$ is a bound on the true transferability minus a complexity measure on the model class from which we choose our model [25]. We leave a more refined study as future work.

Comparisons of our method against some notable related work are provided in Sec. VI-C and summarized in Table I, from which it is clear that our method is among the fastest and most flexible.

VI Related Work

VI-A Connecting Neural Networks With Kernel Methods

While some works establish connections via exactly matching one architecture to the other, others do so from a probabilistic perspective by studying large-sample behavior in the infinite layer widths limit and/or taking the expectation over the network parameters. The former line of research, to which this work belongs, often yields more direct and practical results since the theories operate under much milder assumptions.

VI-A1 Exact Equivalences

In [19], a family of kernels were defined to mimic single-hidden-layer MLPs. The resulting KMs bear the same mathematical formulations as the corresponding NNs with the constraint that the input layer weights of the NNs are fixed. The authors of [26] modified these kernels to allow the KMs to correspond to fully-trainable single-hidden-layer MLPs. Their construction can be viewed as a special case of ours. Nevertheless, they presented their work as another way to train MLPs without pointing out the connections with KMs, i.e., their MLPs are in fact also KMs. The input and output layers are trained alternately, with the former learning to minimize the VC dimension [27] of the latter while the latter learning to classify. An optimality guarantee of the training was hinted. In contrast, we extended their theoretical framework, explicitly established the connections between NNs and KMs, and proposed a fully modular training approach with proved strong optimality guarantee.

VI-A2 Equivalences in Infinite Widths and/or in Expectation

That single-hidden-layer MLPs are Gaussian processes in the infinite width limit and in expectation of random input layer has been known at least since [1]. [2] generalized the classic result to deeper MLPs. [5] defined a family of “arc-cosine” kernels to imitate the computations performed by infinitely wide networks in expectation. [4] proposed kernels that are equivalent to expectations of finite-widths random networks. [7] presented exact computations of some kernels, using which the kernel regression models can be shown to be the limit (in widths and training time) of fully-trainable, infinitely wide fully-connected networks trained with gradient descent. In comparison, our work established that NNs can be directly viewed as KMs, requiring neither infinite widths nor taking expectation over random parameters.

VI-B Modularized Deep Learning

NNs were developed with inspiration from human brain structures and functions [28]. The human brain itself is modular in a hierarchical manner, as the learning process always occurs in a very localized subset of highly inter-connected nodes which are relatively sparsely connected to nodes in other modules [29, 30]. That is to say that the human brain is organized as functional, sparsely connected subunits [31].

Many existing works in machine learning can be analyzed from the perspective of modularization. An old example is the mixture of experts [32, 33] which uses a gating function to enforce each expert in a committee of networks to solve a distinct group of training cases. Another recent example is the generative adversarial networks (GANs) [34]. Typical GANs have two competing neural networks that are essentially decoupled in functionality and can be viewed as two modules. In [35], the authors proposed an a posteriori method that analyzes a trained network as modules in order to extract useful information. Most works in this direction, however, achieved only partial modularization due to their dependence of end-to-end optimization.

The authors of [36] pioneered the idea of greedily learn the architecture of an NN, achieving full modularization. In their work, each new node is added to maximize the correlation between its output and the residual error. Several works attempted to remove the need for end-to-end backpropagation via approximating gradient signals locally at each layer or each node [37, 38, 39, 40, 41]. In [42], a backpropagation-free deep architecture based on decision trees was proposed. In [43], the authors proposed to learn the hidden layers with unsupervised contrastive learning, decoupling their training from that of the output layer. Compared to these existing methods that enable full modularization, our method is simple to implement and provide strong optimality guarantee. A similar provably optimal modular training framework was proposed in [44]. However, their optimality result was less general than ours and their framework required that the NN be modified by substituting certain neurons with KMs using classical kernels that have quadratic runtime.

VI-C Task Transferability

Describing the task space structure via measuring the transferability between tasks, i.e., estimating to what extent representations learned from one task can help learning another, is a central problem in transfer learning, multi-task learning, meta learning, continual/lifelong learning, etc [22, 23, 10, 24].

Information-theoretic measures have been proven useful in quantifying task transferability. Early task-relatedness measures include the $\mathcal{F}$ -relatedness [45] and the $\mathcal{A}$ -distance [46]. $\mathcal{H}$ -divergence [47] and Wasserstein distance [48] have received increasing attention in recent years. Specifically, [49] applied $\mathcal{H}$ -divergence in natural language processing for text classification, whereas [50] used Wasserstein distance to estimate the similarity of linear parameters instead of the data generation distributions. Along this line of research, some more recent methods include H-score [51] and the Bregman-correntropy conditional divergence [52], the latter of which used the correntropy functional [53] and the Bregman matrix divergence [54] to quantify divergence between mappings.

Model-based methods, i.e., methods that leverage trained models to extract knowledge about tasks, are more closely related to our proposed method. [10] estimated task transferability via the negative conditional entropy between their training label sequences, assuming the two tasks share the same input data examples. The proposed method assumed the existence of an optimal trained source model but did not actually use a trained model in the estimation. [24] removed the input-sharing assumption. Both works assumed that the source and the target tasks use the cross-entropy loss. Our method works, on the other hand, asserts mild assumptions on the choice of loss function only. Taskonomy [22] estimated the transferability of a pair of tasks via the transfer performance from the tasks to a third reference task. Task2Vec [23] tuned a single “probe” network on all target tasks and extracted task-characterizing information via the parameters of this network. Both Taskonomy and Task2Vec required training models on target tasks to be able to extract task information. In contrast, our method does not require any training on target tasks.

All of the mentioned model-based methods have their limiting assumptions. The main assumption of our method, required by our optimality guarantee, is that the output module (classifier) that will be tuned on top of the transferred component admits a particular form as described in Theorem IV.1. Essentially, it is required to be a single-layer NN. It is worth noting that this set-up is common in practice. We summarize the properties of these related methods in Table I.

Method	Training	Test Runtime	Main Assumption
TASK2VEC [23]	✓		a reference model
Taskonomy [22]	✓		a third reference task
NCE [10]	✗	$\mathcal{O}(n^{3})$	tasks share input
LEEP [24]	✗	$\mathcal{O}(n^{2})$	cross-entropy loss
Ours	✗	$\mathcal{O}(n^{2})$	form of classifier

TABLE I: Comparisons with similar methods for task transferability estimation. For the training-free methods, we also provide the test runtime in terms of

n

, the size of the target task dataset that is being used for the estimation. Note that the extra trainings constitute the main overhead of the methods that require training, making comparisons on their test runtime against the training-free methods meaningless. Details are provided in Sec. VI-C.

VII Experiments

VII-A Sanity Check: Modular Training Results in Identical Learning Dynamics As End-to-End

In this section, we attempt to answer the important question: Do end-to-end and the proposed modular training, when restricted to the architecturally identical hidden modules, drive the underlying modules to functions that are the same in terms of minimizing the overall loss function? Clearly, a positive answer would verify empirically the optimality of the modular approach.

We now test this hypothesis with toy data. The set up is as follows. We generate 1000 32-dimensional random input and assign them random labels from 10 classes. The underlying network consists two modules. The input module is a $(32\to 512)$ fully-connected (fc) layer followed by ReLU nonlinearity and another $(512\to 2)$ fc layer. The output module is a $(2\to 10)$ fc layer. The two modules are linked by a tanh nonlinearity (output normalized to unit vector). The overall loss is the cross-entropy loss. And for modular training, the proxy objective is CTS-NEO. We visualize the activations from the tanh nonlinearity as an indicator of the behavior of the hidden layers under training.

From Fig. 4, we see that the underlying input modules are indeed driven to the same functions (when restricted to the training data and in terms of minimizing the overall loss) by both modular and end-to-end in the limit of training time going to infinity. This confirms the optimality of our proposed method (albeit in a simplified set-up) and allows us to modularize learning with confidence.

VII-B Sanity Check: Proxy Objectives Align Well With Accuracy

We now extend the verification of optimality from the simplified set-up in the previous section to a more practical setting. Recall that we have characterized our proxy objective as function of the input module whose maximizers constitute parts of the overall loss minimizers and we have proposed to train the input module to maximize the proxy objective.

Ideally, however, a proxy objective for input module should satisfy: The overall accuracy is a function of solely the proxy value and the output module and as a function of the proxy value, the overall accuracy is strictly increasing.³³3This type of results is reminiscent of the “ideal bounds” for representation learning sought for in, e.g., [55]. We now demonstrate empirically that the proxies we proposed enjoy this property.

The set-up is as follows. We train a LeNet-5 as two modules on MNIST with $(\text{conv1}\to\text{tanh}\to\text{max pool}\to\text{conv2}\to\text{tanh}\to\text{max pool}\to\text{fc1}\to\text{tanh}\to\text{fc2})$ as the input module, and $(\text{tanh}\to\text{fc3})$ as the output one. The input module is trained with a number of different epochs so as to achieve different proxy values. And for each epoch, we freeze the input module and train output module to minimize the overall cross-entropy until it converges. Mathematically, suppose we froze input module at $F_{1}^{\prime}$ and we denote the proxy as $L_{1}$ and overall accuracy as $A$ , we are visualizing $\max_{F_{2}}A(F_{2}\circ F_{1}^{\prime})$ vs. $L_{1}(F_{1}^{\prime})$ , where $F_{2}$ is the output module.

Fig. 5 shows that the overall accuracy, as a function of the proxy value, is indeed approximately increasing. Further, this positive correlation becomes near perfect in high-accuracy regime, which agrees with our theoretical results. To summarize, we have extended our theoretical guarantees and empirically verified that maximizing the proposed proxy objectives effectively learns input modules that are optimal in terms of maximizing the overall accuracy, rendering end-to-end training unnecessary.

VII-C Accuracy on MNIST and CIFAR-10

We now present the main results on MNIST and CIFAR-10 demonstrating the effectiveness of our modular learning method. To facilitate fair comparisons, end-to-end and modular training operate on the same backbone network. For all results, we used stochastic gradient descent as the optimizer with batch size $128$ . For each module in the modular method as well as the end-to-end baseline, we trained with annealing learning rates ( $0.1,0.01,0.001$ , each for $200$ epochs). The momentum was set to $0.9$ throughout. For data preprocessing, we used simple mean subtraction followed by division by standard deviation. On CIFAR-10, we used the standard data augmentation pipeline from [9], i.e., random flipping and clipping with the exact parameters specified therein. In modular training, the models were trained as two modules, with the output layer alone as the output module and everything else as the input module as specified in Sec. IV-C1. We normalized the activation vector right before the output layer of each model to a unit vector such that the equal-norm condition required by Theorem IV.1 is satisfied. We did not observe a significant performance difference after this normalization.

From Table II and III, we see that on three different backbone networks and both MNIST and CIFAR-10, our modular learning compares favorably against end-to-end. Since under two-module modular training, the ResNets can be viewed as KMs with adaptive kernels, we also compared performance with other NN-inspired kernel methods in the literature. In Table III, the ResNet KMs clearly outperformed other kernel methods by a considerable margin.

Model	Training	Obj. Fn.	Acc. (%)
LeNet-5 (ReLU)	e2e	XE	99.33
LeNet-5 (tanh)	e2e	XE	99.32
LeNet-5 (ReLU)	mdlr	AL/XE	99.35
LeNet-5 (ReLU)	mdlr	UTAL/XE	99.42
LeNet-5 (ReLU)	mdlr	MSE/XE	99.36
LeNet-5 (tanh)	mdlr	AL(-NEO)/XE	99.11 (99.19)
LeNet-5 (tanh)	mdlr	UTAL(-NEO)/XE	99.21 (99.11)
LeNet-5 (tanh)	mdlr	MSE(-NEO)/XE	99.27 (99.23)
LeNet-5 (tanh)	mdlr	CTS(-NEO)/XE	99.16 (99.16)

TABLE II: Accuracy on MNIST. e2e for end-to-end. mdlr for modular as two modules. Obj. Fn. specifies the loss function (and the proxy objective for the input module, if applicable) used. XE stands for cross-entropy. Definitions of the proxy objectives can be found in Sec. IV-D. Using LeNet-5 as the network backbone on MNIST, our modular approach compares favorably against end-to-end training.

Model	Data Aug.	Training	Obj. Fn.	Acc. (%)
ResNet-18	flip & clip	e2e	XE	94.91
ResNet-152	flip & clip	e2e	XE	95.87
ResNet-18	flip & clip	mdlr	AL/XE	94.93
ResNet-152	flip & clip	mdlr	AL/XE	95.73
CKN [56]	none			82.18
CNTK [8]	flip			81.40
CNTK^∗ [8]	flip			88.36
CNN-GP [8]	flip			82.20
CNN-GP^∗ [8]	flip			88.92
NKWT^∗ [4]	flip			89.80

TABLE III: Accuracy on CIFAR-10. Modular training yielded favorable results compared to end-to-end. To the best of our knowledge, there is no other purely modular training method that matches backpropagation with a competitive network backbone on CIFAR-10 [41, 38, 39, 43, 44]. The ResNets trained with our modular approach can be viewed as kernel machines and outperformed other existing NN-inspired kernel methods.

*

means the method used more sophisticated data preprocessing than the ResNets. Note that all of the baseline kernel methods have quadratic runtime and it is nontrivial to incorporate data augmentation into their pipelines.

VII-D Label Efficiency of Modular Deep Learning

Our modular training needs only implicit labels (weak supervision) for training the latent modules. Full supervision is only needed for the output module. And we argue that in fact, the modular training of output module requires a smaller number of fully-labeled data than end-to-end training.

Intuitively, the input module in a deep architecture can be understood as learning a new representation of the given data with which the output module’s classification task is simplified. A simple way to quantify the difficulty of a learning task is through its sample complexity, which, when put in simple terms, refers to the number of labeled examples needed for a given model and a training algorithm to achieve a certain level of test accuracy [11].

In a modular training setting, one can decouple the training of input and output modules and then it would make sense to discuss the sample complexity of the two modules individually. The output modules should require fewer labeled data examples to train since its task has been simplified when the input one has been well-trained. We now observe that this is indeed the case.

We trained ResNet-18 on CIFAR-10 with end-to-end and our modular approach. The input module was trained with the full training set in the modular method, but again, this only requires implicit pairwise labels. We now compare the need for fully-labeled data between the two training methods.

With the full training set of $50000$ labeled examples, the modular and end-to-end model achieved $94.93\%$ and $94.91\%$ test accuracy, respectively. From Fig. 6, we see that while the end-to-end model struggled in the label-scarce regime, achieving barely over $20\%$ accuracy with $50$ fully-labeled examples, the modular model consistently achieved strong performance, achieving $94.88\%$ accuracy with $30$ randomly chosen fully-labeled examples for training. In fact, if we ensure that there is at least one example from each class in the training data, the modular approach needed as few as $10$ randomly chosen examples to achieve $94.88\%$ accuracy, that is, a single randomly chosen example per class.⁴⁴4The fact that the modular model underperformed at $10$ and $20$ labels is likely caused by that there were some classes missing in the training data that we randomly selected. Specifically, $4$ classes were missing in the $10$ -example training data, and $2$ in the $20$ -example data.

These observations suggest that our modular training method for classification can almost completely rely on implicit pairwise labels and still produce highly performant models, which suggests new paradigms for obtaining labeled data that can potentially be less costly than the existing ones. This also indicates that the current form of strong supervision, i.e., full labels on individual data examples, relied on by existing end-to-end training, is not efficient enough. In particular, we have shown that by leveraging pairwise information in the RKHS, implicit labels in the form of whether pairs of examples are from the same class are just as informative. This may have further implications for un/semi-supervised learning.

VII-D1 Connections With Un/Semi-Supervised Learning

Whenever the training batch size is less than the size of the entire training set, we are using strictly less supervision than using full labels.⁵⁵5This is because if one does not cache information over batches, it is impossible to recover the labels for the entire training set if one is only given pairwise relationships on each batch. The fact that our learning method still produces equally performant models suggests that we are using supervision more efficiently.

Existing semi-supervised methods also aim at utilizing supervision more efficiently. They still rely on end-to-end training and use full labels, but achieve enhanced efficiency mostly through data augmentation. While our results are not directly comparable to those from existing methods (these methods do not work in the same setting as ours: They use fewer full labels while we use mostly implicit labels), as a reference point, the state-of-the-art semi-supervised algorithm achieves $88.61\%$ accuracy on CIFAR-10 with $40$ labels with a stronger backbone network (Wide ResNet-28-2) and a significantly more complicated training pipeline [57].

Therefore, we think the insights gained from our modular learning suggest that (1) the current form of supervision, i.e., full labels on individual data examples, although pervasively used by supervised and semi-supervised learning methods, is far from efficient and (2) by leveraging novel forms of labels and learning methods, simpler yet stronger un/semi-supervised learning algorithms exist. We leave this as a future work.

VII-E Transferability Estimation With Proxy Objective

To empirically verify the effectiveness of our transferability estimation method, we created a set of tasks by selecting $6$ classes from CIFAR-10 and grouping each pair into a single binary classification task. The classes selected are: cat, automobile, dog, horse, truck, deer. Each new task is named by concatenating the first two letters from the classes involved, e.g., cado refers to the task of classifying dogs from cats.

We trained one ResNet-18 using the proposed modular method for each source task, using AL as the proxy objective. For each model, the output linear layer plus the nonlinearity preceding it is the output module, and everything else belongs to the input module. All networks achieved on average $98.7\%$ test accuracy on the task they were trained on⁶⁶6Highest accuracy was $99.95\%$ , achieved on audo. Lowest was $93.05\%$ , on cado., suggesting that each frozen input module possesses information essential to the source task. Therefore, the question of quantifying input module reusability is essentially the same as describing the transferability between each pair of tasks.

The true transferability between a source and a target task in the task space can be quantified by the test performance of the optimal output module trained for the target task on top of a frozen input module from a source task [10]. Our estimation of transferability is based purely on proxy objective: A frozen input module achieving higher proxy objective value on a target task means that the source and target are more transferable. This estimation is performed using a randomly selected subset of the target task training data.

In Fig. 7, we visualize the target space structure with trdo being the source and the target, respectively. We see from the figures that our method accurately described the target space structure despite its simplicity, using only $10\%$ of target task training data. The discovered transferability relationships also align well with common sense. For example, trdo is more transferable to, e.g., detr, since they both contain the truck class. trdo also transfers well with auca because features useful for distinguishing trucks from dogs are likely also helpful for classifying automobiles from cats.

VIII Conclusions

In this paper, we proposed a simple, alternative view on NNs that turns layers into linear models in feature spaces and showed that they are KMs. Based on this construction, we presented a modular learning framework for classification that does not require between-module propagation. Focusing on the two-module instantiation, we proved its optimality, demonstrated that it matches state-of-the-art performance from end-to-end backpropagation on MNIST and CIFAR-10, and showed that it learns powerful classifiers using almost only implicit, more efficient pairwise labels. This modular learning framework enables fully modularized DL workflows. We then demonstrated the benefit of such a workflow in a transfer learning setting, where the transferability among 15 binary classification tasks from CIFAR-10 were to be estimated. Our simple approach accurately described the task space structure using a fraction of target task training data with practically no computation overhead.

Some future work include a rigorous study of the transferability estimation problem and our proposed solution. Both theoretical and empirical results can be obtained to enhance the understanding of our method. A more in-depth study on how our modular learning method can help produce better un/semi-supervised learning paradigms is also of interest. A study on how the proposed proxy objectives behave in practice and if and how network design choices affect our modular learning can be beneficial especially for practitioners.

IX Acknowledgements

This work was supported by DARPA (FA9453-18-1-0039) and ONR (N00014-18-1-2306).

References

[1] R. M. Neal, “Bayesian learning for neural networks,” Ph.D. dissertation, University of Toronto, 1995.
[2] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein, “Deep neural networks as gaussian processes,” arXiv preprint arXiv:1711.00165, 2017.
[3] A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,” in Advances in neural information processing systems, 2018, pp. 8571–8580.
[4] V. Shankar, A. Fang, W. Guo, S. Fridovich-Keil, L. Schmidt, J. Ragan-Kelley, and B. Recht, “Neural kernels without tangents,” arXiv preprint arXiv:2003.02237, 2020.
[5] Y. Cho and L. K. Saul, “Kernel methods for deep learning,” in Advances in neural information processing systems, 2009, pp. 342–350.
[6] S. Arora, S. S. Du, Z. Li, R. Salakhutdinov, R. Wang, and D. Yu, “Harnessing the power of infinitely wide deep nets on small-data tasks,” arXiv preprint arXiv:1910.01663, 2019.
[7] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” in Advances in Neural Information Processing Systems, 2019, pp. 8139–8148.
[8] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov, and S. Arora, “Enhanced convolutional neural tangent kernels,” arXiv preprint arXiv:1911.00809, 2019.
[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[10] A. T. Tran, C. V. Nguyen, and T. Hassner, “Transferability and hardness of supervised classification tasks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 1395–1405.
[11] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
[12] N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American mathematical society, vol. 68, no. 3, pp. 337–404, 1950.
[13] J. H. Manton, P.-O. Amblard et al., “A primer on reproducing kernel hilbert spaces,” Foundations and Trends® in Signal Processing, vol. 8, no. 1–2, pp. 1–126, 2015.
[14] B. Scholkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2001.
[15] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” in Advances in neural information processing systems, 2008, pp. 1177–1184.
[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[17] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
[18] M. Gönen and E. Alpaydın, “Multiple kernel learning algorithms,” Journal of machine learning research, vol. 12, no. Jul, pp. 2211–2268, 2011.
[19] V. Vapnik, “The nature of statistical learning theory,” 2000.
[20] J. Shawe-Taylor, N. Cristianini et al., Kernel methods for pattern analysis. Cambridge university press, 2004.
[21] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola, “On kernel-target alignment,” in Advances in neural information processing systems, 2002, pp. 367–373.
[22] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722.
[23] A. Achille, M. Lam, R. Tewari, A. Ravichandran, S. Maji, C. C. Fowlkes, S. Soatto, and P. Perona, “Task2vec: Task embedding for meta-learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6430–6439.
[24] C. V. Nguyen, T. Hassner, C. Archambeau, and M. Seeger, “Leep: A new measure to evaluate transferability of learned representations,” arXiv preprint arXiv:2002.12462, 2020.
[25] P. L. Bartlett and S. Mendelson, “Rademacher and gaussian complexities: Risk bounds and structural results,” Journal of Machine Learning Research, vol. 3, no. Nov, pp. 463–482, 2002.
[26] J. A. Suykens and J. Vandewalle, “Training multilayer perceptron classifiers based on a modified support vector method,” IEEE transactions on Neural Networks, vol. 10, no. 4, pp. 907–911, 1999.
[27] V. Vapnik and A. Y. Chervonenkis, “On the uniform convergence of relative frequencies of events to their probabilities,” Theory of Probability & Its Applications, vol. 16, no. 2, pp. 264–280, 1971.
[28] E. R. Kandel, J. H. Schwartz, T. M. Jessell, D. of Biochemistry, M. B. T. Jessell, S. Siegelbaum, and A. Hudspeth, Principles of neural science. McGraw-hill New York, 2000, vol. 4.
[29] T. Hrycej, Modular learning in neural networks: a modularized approach to neural network classification. John Wiley & Sons, Inc., 1992.
[30] D. Meunier, R. Lambiotte, and E. T. Bullmore, “Modular and hierarchically modular organization of brain networks,” Frontiers in neuroscience, vol. 4, p. 200, 2010.
[31] J. Clune, J.-B. Mouret, and H. Lipson, “The evolutionary origins of modularity,” Proceedings of the Royal Society b: Biological sciences, vol. 280, no. 1755, p. 20122863, 2013.
[32] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
[33] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,” Neural computation, vol. 6, no. 2, pp. 181–214, 1994.
[34] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[35] C. Watanabe, K. Hiramatsu, and K. Kashino, “Modular representation of layered neural networks,” Neural Networks, vol. 97, pp. 62–73, 2018.
[36] S. E. Fahlman and C. Lebiere, “The cascade-correlation learning architecture,” in Advances in neural information processing systems, 1990, pp. 524–532.
[37] Y. Bengio, “How auto-encoders could provide credit assignment in deep networks via target propagation,” arXiv preprint arXiv:1407.7906, 2014.
[38] D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio, “Difference target propagation,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2015, pp. 498–515.
[39] M. Carreira-Perpinan and W. Wang, “Distributed optimization of deeply nested systems,” in Artificial Intelligence and Statistics, 2014, pp. 10–19.
[40] D. Balduzzi, H. Vanchinathan, and J. Buhmann, “Kickback cuts backprop’s red-tape: biologically plausible credit assignment in neural networks,” in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[41] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, and K. Kavukcuoglu, “Decoupled neural interfaces using synthetic gradients,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 1627–1635.
[42] Z.-H. Zhou and J. Feng, “Deep forest,” arXiv preprint arXiv:1702.08835, 2017.
[43] S. Löwe, P. O’Connor, and B. Veeling, “Putting an end to end-to-end: Gradient-isolated learning of representations,” in Advances in Neural Information Processing Systems, 2019, pp. 3033–3045.
[44] S. Duan, S. Yu, Y. Chen, and J. C. Principe, “On kernel method–based connectionist models and supervised deep learning without backpropagation,” Neural computation, pp. 1–39, 2019.
[45] S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple task learning,” in Learning Theory and Kernel Machines. Springer, 2003, pp. 567–580.
[46] S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira, “Analysis of representations for domain adaptation,” in Advances in neural information processing systems, 2007, pp. 137–144.
[47] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 2096–2030, 2016.
[48] Y. Li, D. E. Carlson et al., “Extracting relationships by multi-domain matching,” in Advances in Neural Information Processing Systems, 2018, pp. 6798–6809.
[49] P. Liu, X. Qiu, and X. Huang, “Adversarial multi-task learning for text classification,” arXiv preprint arXiv:1704.05742, 2017.
[50] H. Janati, M. Cuturi, and A. Gramfort, “Wasserstein regularization for sparse multi-task regression,” arXiv preprint arXiv:1805.07833, 2018.
[51] Y. Bao, Y. Li, S.-L. Huang, L. Zhang, L. Zheng, A. Zamir, and L. Guibas, “An information-theoretic approach to transferability in task transfer learning,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 2309–2313.
[52] S. Yu, A. Shaker, F. Alesiani, and J. C. Principe, “Measuring the discrepancy between conditional distributions: Methods, properties and applications,” 2020.
[53] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Properties and applications in non-gaussian signal processing,” IEEE Transactions on Signal Processing, vol. 55, no. 11, pp. 5286–5298, 2007.
[54] B. Kulis, M. A. Sustik, and I. S. Dhillon, “Low-rank kernel learning with bregman matrix divergences,” Journal of Machine Learning Research, vol. 10, no. Feb, pp. 341–376, 2009.
[55] S. Arora, H. Khandeparkar, M. Khodak, O. Plevrakis, and N. Saunshi, “A theoretical analysis of contrastive unsupervised representation learning,” arXiv preprint arXiv:1902.09229, 2019.
[56] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,” in Advances in neural information processing systems, 2014, pp. 2627–2635.
[57] K. Sohn, D. Berthelot, C.-L. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel, “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” arXiv preprint arXiv:2001.07685, 2020.

1. Proving Theorem IV.1

Lemma .1.

Given an inner product space $H$ , a unit vector $\mathbf{e}\in H$ , and four other vectors $\mathbf{v}_{+},\mathbf{v}_{-},\mathbf{v}_{+}^{\star}$ , and $\mathbf{v}_{-}^{\star}$ with $\|\mathbf{v}_{+}\|_{H}=\|\mathbf{v}_{-}\|_{H}=\|\mathbf{v}_{+}^{\star}\|_{H}=\|\mathbf{v}_{-}^{\star}\|_{H}>0$ , where the norm is the canonical norm induced by the inner product, i.e., $\|\mathbf{u}\|_{H}^{2}:=\langle\mathbf{u},\mathbf{u}\rangle_{H},\forall\mathbf{u}\in H$ . Assume

\displaystyle\|\mathbf{v}_{+}-\mathbf{v}_{-}\|_{H}\leq\|\mathbf{v}_{+}^{\star}-\mathbf{v}_{-}^{\star}\|_{H}.

(24)

Then there exists a unit vector $\mathbf{e}^{\star}\in H$ such that

	$\displaystyle\langle\mathbf{e},\mathbf{v}_{+}\rangle_{H}\leq\langle\mathbf{e}^{\star},\mathbf{v}_{+}^{\star}\rangle_{H};$		(25)
	$\displaystyle\langle\mathbf{e},\mathbf{v}_{-}\rangle_{H}\geq\langle\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}\rangle_{H}.$		(26)

Proof.

Throughout, we omit the subscript $H$ on inner products and norms for brevity, which shall cause no ambiguity since no other inner product or norm will be involved in this proof.

If $\mathbf{v}_{+}^{\star}=\mathbf{v}_{-}^{\star}$ , then $\|\mathbf{v}_{+}^{\star}-\mathbf{v}_{-}^{\star}\|=0$ , which would imply

\|\mathbf{v}_{+}-\mathbf{v}_{-}\|=0\Longleftrightarrow\mathbf{v}_{+}=\mathbf{v}_{-}.

(27)

Then we may choose $\mathbf{e}^{\star}$ such that $\langle\mathbf{e}^{\star},\mathbf{v}_{+}^{\star}\rangle=\langle\mathbf{e},\mathbf{v}_{+}\rangle$ and the result holds trivially.

On the other hand, if $\mathbf{v}_{+}^{\star}=-\mathbf{v}_{+}^{\star}$ , $\mathbf{e}^{\star}=\mathbf{v}_{+}^{\star}/\|\mathbf{v}_{+}^{\star}\|$ would be a valid choice of $\mathbf{e}^{\star}$ . Indeed, by Cauchy-Schwarz,

	$\displaystyle\langle\mathbf{e}^{\star},\mathbf{v}_{+}^{\star}\rangle=\\|\mathbf{v}_{+}^{\star}\\|\geq\langle\mathbf{e},\mathbf{v}_{+}\rangle;$		(28)
	$\displaystyle\langle\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}\rangle=-\\|\mathbf{v}_{-}^{\star}\\|\leq\langle\mathbf{e},\mathbf{v}_{-}\rangle.$		(29)

Therefore, we may assume that $\mathbf{v}_{+}^{\star}\neq\pm\mathbf{v}_{-}^{\star}$ .

For two vectors $\mathbf{a},\mathbf{b}$ , we define the “angle” between them, denoted $\theta\in[0,\pi]$ , via $\cos\theta:=\langle\mathbf{a},\mathbf{b}\rangle/(\|\mathbf{a}\|\|\mathbf{b}\|)$ . This angle is well-defined since $\cos$ is injective on $[0,\pi]$ .

Since

\langle\mathbf{a},\mathbf{b}\rangle=-\frac{1}{2}\|\mathbf{a}-\mathbf{b}\|^{2}+\frac{1}{2}\|\mathbf{a}\|^{2}+\frac{1}{2}\|\mathbf{b}\|^{2},\forall\mathbf{a},\mathbf{b},

(30)

$\|\mathbf{v}_{+}\|=\|\mathbf{v}_{-}\|=\|\mathbf{v}_{+}^{\star}\|=\|\mathbf{v}_{-}^{\star}\|>0$ and $\|\mathbf{v}_{+}-\mathbf{v}_{-}\|\leq\|\mathbf{v}_{+}^{\star}-\mathbf{v}_{-}^{\star}\|$ together implies $\langle\mathbf{v}_{+},\mathbf{v}_{-}\rangle/(\|\mathbf{v}_{+}\|\|\mathbf{v}_{-}\|)\geq\langle\mathbf{v}_{+}^{\star},\mathbf{v}_{-}^{\star}\rangle/(\|\mathbf{v}_{+}^{\star}\|\|\mathbf{v}_{-}^{\star}\|)$ , which then implies $\theta\leq\theta^{\star}$ since $\cos$ is strictly decreasing on $[0,\pi]$ , where $\theta$ is the angle between $\mathbf{v}_{+},\mathbf{v}_{-}$ and $\theta^{\star}$ is the angle between $\mathbf{v}_{+}^{\star},\mathbf{v}_{-}^{\star}$ .

Let $\gamma_{+}$ be the angle between $\mathbf{e},\mathbf{v}_{+}$ and $\gamma_{-}$ that between $\mathbf{e},\mathbf{v}_{-}$ . Note that $\gamma_{-}-\gamma_{+}\in[0,\pi]$ . Define

	$\displaystyle p:=\langle\mathbf{e},\mathbf{v}_{+}\rangle=\\|\mathbf{v}_{+}\\|\cos\gamma_{+};$		(31)
	$\displaystyle n:=\langle\mathbf{e},\mathbf{v}_{-}\rangle=\\|\mathbf{v}_{-}\\|\cos\gamma_{-}.$		(32)

Now, suppose that we have shown

		$\displaystyle\gamma_{-}-\gamma_{+}\leq\theta;$		(33)
		$\displaystyle\exists\mathbf{e}^{\star}\text{ s.t. }\gamma_{-}^{\star}-\gamma_{+}^{\star}=\theta^{\star}\text{ and }\gamma_{-}^{\star}=\gamma_{-},$		(34)

where $\gamma_{+}^{\star}$ is the angle between $\mathbf{e}^{\star},\mathbf{v}_{+}^{\star}$ and $\gamma_{-}^{\star}$ that between $\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}$ . Define:

	$\displaystyle p^{\star}:=\langle\mathbf{e}^{\star},\mathbf{v}_{+}^{\star}\rangle=\\|\mathbf{v}_{+}^{\star}\\|\cos\gamma_{+}^{\star};$		(35)
	$\displaystyle n^{\star}:=\langle\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}\rangle=\\|\mathbf{v}_{-}^{\star}\\|\cos\gamma_{-}^{\star}.$		(36)

Then using $\|\mathbf{v}_{-}\|=\|\mathbf{v}_{-}^{\star}\|>0$ and the earlier result that $\theta\leq\theta^{\star}$ , we would have

	$\displaystyle n=n^{\star};$		(37)
	$\displaystyle\gamma_{+}\geq\gamma_{+}^{\star},$		(38)

which, together with the fact that $\cos$ is strictly decreasing on $[0,\pi]$ and the assumption that $\|\mathbf{v}_{+}\|=\|\mathbf{v}_{+}^{\star}\|>0$ , implies

	$\displaystyle n=n^{\star};$		(39)
	$\displaystyle p\leq p^{\star},$		(40)

proving the result.

To prove Eq. 33, it suffices to show $\cos(\gamma_{-}-\gamma_{+})\geq\cos\theta$ since $\cos$ is decreasing on $[0,\pi]$ , to which both $\gamma_{-}-\gamma_{+}$ and $\theta$ belong. To this end, we have

	$\displaystyle\cos(\gamma_{-}-\gamma_{+})$	$\displaystyle=\cos\gamma_{-}\cos\gamma_{+}+\sin\gamma_{-}\sin\gamma_{+}$		(41)
		$\displaystyle=\frac{pn}{\\|\mathbf{v}_{+}\\|\\|\mathbf{v}_{-}\\|}+\sqrt{\frac{(\\|\mathbf{v}_{+}\\|^{2}-p^{2})(\\|\mathbf{v}_{-}\\|^{2}-n^{2})}{\\|\mathbf{v}_{+}\\|^{2}\\|\mathbf{v}_{-}\\|^{2}}}.$		(42)

Since $\|\mathbf{v}_{+}\|^{2}-p^{2}=\|\mathbf{v}_{+}\|^{2}+p^{2}-2p^{2}=\|\mathbf{v}_{+}\|^{2}+p^{2}-2p\langle\mathbf{e},\mathbf{v}_{+}\rangle=\|\mathbf{v}_{+}-p\mathbf{e}\|^{2}$ and similarly $\|\mathbf{v}_{-}\|^{2}-n^{2}=\|\mathbf{v}_{-}-n\mathbf{e}\|^{2}$ ,

$\displaystyle\cos(\gamma_{-}-\gamma_{+})$	$\displaystyle=\frac{pn+\\|\mathbf{v}_{+}-p\mathbf{e}\\|\\|\mathbf{v}_{-}-n\mathbf{e}\\|}{\\|\mathbf{v}_{+}\\|\\|\mathbf{v}_{-}\\|}$	(43)
	$\displaystyle\geq\frac{pn+\langle\mathbf{v}_{+}-p\mathbf{e},\mathbf{v}_{-}-n\mathbf{e}\rangle}{\\|\mathbf{v}_{+}\\|\\|\mathbf{v}_{-}\\|}$	(44)
	$\displaystyle=\frac{\langle\mathbf{v}_{+},\mathbf{v}_{-}\rangle}{\\|\mathbf{v}_{+}\\|\\|\mathbf{v}_{-}\\|}$	(45)
	$\displaystyle=\cos\theta,$	(46)

where the inequality is due to Cauchy-Schwarz.

To prove 34, it suffices to show that there exists $\mathbf{e}^{\star}$ such that

A

one of $\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}$ is a scalar multiple of the other;
B

$\cos\gamma_{-}^{\star}=\cos\gamma_{-}$ .

Indeed, A implies $\|\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star}\|\|\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}\|=\langle\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star},\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}\rangle$ , which, together with similar arguments as we used to prove Eq. 33, implies $\gamma_{-}^{\star}-\gamma_{+}^{\star}=\theta^{\star}$ . Also note that B is equivalent to $n=n^{\star}$ .

We prove existence constructively. Specifically, we set $\mathbf{e}^{\star}=a\mathbf{v}_{+}^{\star}+b\mathbf{v}_{-}^{\star},a,b\in\mathbb{R}$ and find $a,b$ such that $\mathbf{e}^{\star}$ satisfies A and B simultaneously.

Let $s=\langle\mathbf{v}_{+}^{\star},\mathbf{v}_{-}^{\star}\rangle,r=\|\mathbf{v}_{+}^{\star}\|^{2}=\|\mathbf{v}_{-}^{\star}\|^{2}$ . Note that we immediately have:

r>0;|s|\leq r.

(47)

The assumption $\mathbf{v}_{+}^{\star}\neq\pm\mathbf{v}_{-}^{\star}$ implies $|s|<r$ .

Since $\mathbf{e}^{\star}$ is a unit vector, we have the following constraint on $a,b$ :

a^{2}r+b^{2}r+2abs=1.

(48)

And some simple algebra yields

	$\displaystyle p^{\star}=ar+bs;$		(49)
	$\displaystyle n^{\star}=as+br.$		(50)

To identify those $a,b$ such that A holds, we first rewrite $\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star}$ and $\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}$ as follows.

	$\displaystyle\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star}$	$\displaystyle=\mathbf{v}_{+}^{\star}-(ar+bs)(a\mathbf{v}_{+}^{\star}+b\mathbf{v}_{-}^{\star})$		(51)
		$\displaystyle=(1-a^{2}r-abs)\mathbf{v}_{+}^{\star}+(-abr-b^{2}s)\mathbf{v}_{-}^{\star}.$		(52)

Similarly,

\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}=(-a^{2}s-abr)\mathbf{v}_{+}^{\star}+(1-abs-b^{2}r)\mathbf{v}_{-}^{\star}.

(53)

Define

	$\displaystyle w_{1,+}:=1-a^{2}r-abs;$		(54)
	$\displaystyle w_{1,-}:=-abr-b^{2}s;$		(55)
	$\displaystyle w_{2,+}:=-a^{2}s-abr;$		(56)
	$\displaystyle w_{2,-}:=1-abs-b^{2}r.$		(57)

Then we have

	$\displaystyle\mathbf{v}_{+}^{\star}-p^{\star}\mathbf{e}^{\star}=w_{1,+}\mathbf{v}_{+}^{\star}+w_{1,-}\mathbf{v}_{-}^{\star};$		(58)
	$\displaystyle\mathbf{v}_{-}^{\star}-n^{\star}\mathbf{e}^{\star}=w_{2,+}\mathbf{v}_{+}^{\star}+w_{2,-}\mathbf{v}_{-}^{\star}.$		(59)

Assuming none of $w_{1,+},w_{1,-},w_{2,+},w_{2,-}$ is $0$ , A is equivalent to

w_{1,+}w_{2,-}=w_{1,-}w_{2,+}.

(60)

To check that this is always true, we have

	$\displaystyle w_{1,+}w_{2,-}$		(61)
	$\displaystyle\quad=a^{3}brs+ab^{3}rs+a^{2}b^{2}s^{2}+a^{2}b^{2}r^{2}-2abs-r(a^{2}+b^{2})+1$		(62)
	$\displaystyle\quad=a^{3}brs+ab^{3}rs+a^{2}b^{2}s^{2}+a^{2}b^{2}r^{2}$		(63)

because of Eq. 48. And

\displaystyle w_{1,-}w_{2,+}

\displaystyle=a^{3}brs+ab^{3}rs+a^{2}b^{2}s^{2}+a^{2}b^{2}r^{2}.

(64)

Therefore, A is always true.

If at least one of $w_{1,+},w_{1,-},w_{2,+},w_{2,-}$ is $0$ , we have the following mutually exclusive cases:

i

one of the four coefficients is $0$ while the other three are not; ✗
ii

$w_{1,+}=w_{1,-}=0$ , the others are not $0$ ; ✓
iii

$w_{2,+}=w_{2,-}=0$ , the others are not $0$ ; ✓
iv

$w_{1,+}=w_{2,-}=0$ , the others are not $0$ ; ✗
v

$w_{1,-}=w_{2,+}=0$ , the others are not $0$ ; ✗
vi

$w_{1,+}=w_{2,+}=0$ , the others are not $0$ ; ✓
vii

$w_{1,-}=w_{2,-}=0$ , the others are not $0$ ; ✓
viii

three of the four coefficients are $0$ while one is not; ✓
ix

$w_{1,+}=w_{1,-}=w_{2,+}=w_{2,-}=0$ , ✓

where the cases marked with ✗ are the ones where A cannot be true (so our choice of $a,b$ cannot fall into these cases) and the ones marked with ✓ are the ones where A is true. Note that A cannot be true in cases iv and v because if A was true, it would imply that $\mathbf{v}_{+}^{\star}$ and $\mathbf{v}_{-}^{\star}$ are linearly dependent. However, since $\|\mathbf{v}_{+}^{\star}\|=\|\mathbf{v}_{-}^{\star}\|$ , we would have either $\mathbf{v}_{+}^{\star}=\mathbf{v}_{-}^{\star}$ or $\mathbf{v}_{+}^{\star}=-\mathbf{v}_{-}^{\star}$ , both of which have been excluded from the discussion in the beginning of this proof.

Therefore, A is satisfied by any $a,b$ that satisfy Eq. 48 but none of case i, iv, and v.

We now turn to the search for $a,b$ such that B holds. B is equivalent to

as+br=n\Longleftrightarrow b=\frac{1}{r}(n-as).

(65)

Therefore, finding $a,b$ such that A and B hold simultaneously amounts to finding $a,b$ such that

		$\displaystyle b=\frac{1}{r}(n-as);$		(66)
		$\displaystyle a^{2}r+b^{2}r+2abs=1;$		(67)
		none of case i, iv, v is true.		(68)

Now, substituting $b=(n-as)/r$ into Eq. 67 and solving for $a$ , we have

a^{2}=\frac{r-n^{2}}{r^{2}-s^{2}},

(69)

and we choose

a=\sqrt{\frac{r-n^{2}}{r^{2}-s^{2}}}.

(70)

This root is real since $n=\langle\mathbf{e},\mathbf{v}_{-}\rangle$ and therefore $n^{2}=r\cos^{2}\gamma_{-}\leq r$ .

To verify that

a=\sqrt{\frac{r-n^{2}}{r^{2}-s^{2}}},b=\frac{1}{r}(n-as).

(71)

satisfies 68, first note that since this solution of $a,b$ comes from solving $n=n^{\star}$ and Eq. 67, we have

	$\displaystyle w_{1,+}=1-a(bs+ar)=b(br+as)=bn;$		(72)
	$\displaystyle w_{1,-}\propto b(bs+ar);$		(73)
	$\displaystyle w_{2,+}\propto a(as+br)=an;$		(74)
	$\displaystyle w_{2,-}=1-b(as+br)=1-bn.$		(75)

We now analyze each one of case i, iv, and v individually and show that our choice of $a,b$ does not fall into any of them, proving that this particular $a,b$ satisfies Eq. 66, 67, and condition 68.

case i:

If $w_{1,+}=0$ , then either $b=0$ or $n=0$ , resulting in $w_{1,-}\propto b(bs+ar)=0$ or $w_{2,+}\propto an=0$ , i.e., there must be at least $2$ coefficients being $0$ .

If $w_{1,-}=0$ , then either $b=0$ , in which case $w_{1,+}=0$ , or $bs+ar=0$ . In the latter case, we would then use Eq. 67 and have

	$\displaystyle abs+a^{2}r=0\Longrightarrow b^{2}r+abs=1\Longrightarrow b(br+as)=bn=1$
	$\displaystyle\Longrightarrow w_{2,-}=1-bn=0.$		(76)

Again, we would have at least $2$ nonzero coefficients.

If $w_{2,+}=0$ , then either $n=0$ , which would result in $w_{1,+}=bn=0$ , or $a=0$ . Assuming $a=0$ , Eq. 71 would imply $n=\pm\sqrt{r}$ and $b=n/r$ . Either way, we would have $b=1/n$ and therefore $w_{2,-}=1-bn=0$ .

Finally, if $w_{2,-}=0$ , then first assuming $n\neq 0$ , we would have $b=1/n$ . Then Eq. 71 would give $1/n=(n-as)/r$ . Solving for $a$ , we have $a=(n^{2}-r)/(ns)$ . $r\geq n^{2}$ would imply that $a\leq 0$ . On the other hand, Eq. 71 also implied $a\geq 0$ . Hence, $a$ must be $0$ , in which case $w_{2,+}\propto an=0$ . If $n=0$ , we would have $w_{2,+}=0$ as well.

case iv:

It is easy to see that this case is impossible since $w_{1,+}+w_{2,-}=1$ .

case v:

If $w_{1,-}=w_{2,+}=0$ , then we have already shown in the proof of case i that either $w_{1,+}$ or $w_{2,-}$ must be $0$ , that is, at least $3$ coefficients would be $0$ .

In summary, we have found an $\mathbf{e}^{\star}$ such that A and B hold simultaneously, proving 34. ∎

We now prove Theorem IV.1.

Proof.

The result amounts to proving

L(f_{2}^{\star}\circ F_{2}^{\star},S)\leq L(f_{2}\circ F_{1},S),\forall f_{2},F_{1}.

(77)

Define $S_{+}$ to be the set of all $\mathbf{x}_{i}$ such that $i\in I_{+}$ and $S_{-}$ the set of all $\mathbf{x}_{j}$ such that $j\in I_{-}$ . Let $\kappa=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{\{i\in I_{+}\}}$ , we have

	$\displaystyle L(f_{2}\circ F_{1}^{\star},S)$
	$\displaystyle\quad\leq\kappa\ell_{+}\left(f_{2}\circ F_{1}^{\star}(\mathbf{x}_{+}^{\star})\right)+(1-\kappa)\ell_{-}\left(f_{2}\circ F_{1}^{\star}(\mathbf{x}_{-}^{\star})\right)$
	$\displaystyle\quad\quad+\lambda g(\\|\mathbf{w}\\|)$		(78)
	$\displaystyle\quad=\kappa\ell_{+}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{+}^{\star}\right\rangle\\|\mathbf{w}\\|+b\right)$		(79)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{-}^{\star}\right\rangle\\|\mathbf{w}\\|+b\right)+\lambda g(\\|\mathbf{w}\\|)$

for some $\mathbf{x}_{+}^{\star},\mathbf{x}_{-}^{\star}$ from $S_{+},S_{-}$ , respectively, where $\Phi_{+}^{\star}:=\Phi\left(F_{1}^{\star}(\mathbf{x}_{+}^{\star})\right)$ and $\Phi_{-}^{\star}:=\Phi\left(F_{1}^{\star}(\mathbf{x}_{-}^{\star})\right)$ .

For any $f_{2}^{\prime}\circ F_{1}^{\prime}$ , let $f_{2}^{\prime}$ be parameterized by $\mathbf{w}^{\prime},b^{\prime}$ . We have

	$\displaystyle L(f_{2}^{\prime}\circ F_{1}^{\prime},S)$
	$\displaystyle\quad\geq\kappa\ell_{+}\left(f_{2}^{\prime}\circ F_{1}^{\prime}(\mathbf{x}_{+}^{\prime})\right)+(1-\kappa)\ell_{-}\left(f_{2}^{\prime}\circ F_{1}^{\prime}(\mathbf{x}_{-}^{\prime})\right)$
	$\displaystyle\quad\quad+\lambda g(\\|\mathbf{w}^{\prime}\\|)$		(80)
	$\displaystyle\quad=\kappa\ell_{+}\left(\left\langle\frac{\mathbf{w}^{\prime}}{\\|\mathbf{w}^{\prime}\\|},\Phi_{+}^{\prime}\right\rangle\\|\mathbf{w}^{\prime}\\|+b^{\prime}\right)$		(81)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\frac{\mathbf{w}^{\prime}}{\\|\mathbf{w}^{\prime}\\|},\Phi_{-}^{\prime}\right\rangle\\|\mathbf{w}^{\prime}\\|+b^{\prime}\right)+\lambda g(\\|\mathbf{w}^{\prime}\\|)$

for $\mathbf{x}_{+}^{\prime},\mathbf{x}_{-}^{\prime}$ with $\mathbf{x}_{+}^{\prime}$ maximizing $f_{2}^{\prime}\circ F_{1}^{\prime}(\mathbf{x}_{i})$ over $\mathbf{x}_{i}\in S_{+}$ and $\mathbf{x}_{-}^{\prime}$ minimizing $f_{2}^{\prime}\circ F_{1}^{\prime}(\mathbf{x}_{j})$ over $\mathbf{x}_{j}\in S_{-}$ , where $\Phi_{+}^{\prime}:=\Phi\left(F_{1}^{\prime}(\mathbf{x}_{+}^{\prime})\right)$ and $\Phi_{-}^{\prime}:=\Phi\left(F_{1}^{\prime}(\mathbf{x}_{-}^{\prime})\right)$ .

Using the assumption on $F_{1}^{\star}$ ,

\|\Phi_{+}^{\star}-\Phi_{-}^{\star}\|\geq\|\Phi_{+}^{\prime}-\Phi_{-}^{\prime}\|.

(82)

Then using the earlier Lemma, there exists a unit vector $\mathbf{e}^{\star}$ such that

	$\displaystyle\left\langle\mathbf{e}^{\star},\Phi_{+}^{\star}\right\rangle\geq\left\langle\frac{\mathbf{w}^{\prime}}{\\|\mathbf{w}^{\prime}\\|},\Phi_{+}^{\prime}\right\rangle;$		(83)
	$\displaystyle\left\langle\mathbf{e}^{\star},\Phi_{-}^{\star}\right\rangle\leq\left\langle\frac{\mathbf{w}^{\prime}}{\\|\mathbf{w}^{\prime}\\|},\Phi_{-}^{\prime}\right\rangle.$		(84)

Let $A:=\{\mathbf{w}:\|\mathbf{w}\|=\|\mathbf{w}^{\prime}\|\}$ , then evidently, $\mathbf{e}^{\star}\|\mathbf{w}^{\prime}\|\in A$ , and we have,

	$\displaystyle L(f_{2}^{\prime}\circ F_{1}^{\prime},S)$
	$\displaystyle\quad\geq\kappa\ell_{+}\left(\left\langle\mathbf{e}^{\star},\Phi_{+}^{\star}\right\rangle\\|\mathbf{w}^{\prime}\\|+b^{\prime}\right)$		(86)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\mathbf{e}^{\star},\Phi_{-}^{\star}\right\rangle\\|\mathbf{w}^{\prime}\\|+b^{\prime}\right)+\lambda g\left(\\|\mathbf{w}^{\prime}\\|\right)$
	$\displaystyle\quad=\kappa\ell_{+}\left(\left\langle\frac{\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|}{\left\\|\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|\right\\|},\Phi_{+}^{\star}\right\rangle\left\\|\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|\right\\|+b^{\prime}\right)$		(87)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\frac{\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|}{\left\\|\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|\right\\|},\Phi_{-}^{\star}\right\rangle\left\\|\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|\right\\|+b^{\prime}\right)$
	$\displaystyle\quad\quad+\lambda g\left(\left\\|\mathbf{e}^{\star}\\|\mathbf{w}^{\prime}\\|\right\\|\right)$		(88)
	$\displaystyle\quad\geq\min_{\mathbf{w}\in A}\kappa\ell_{+}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{+}^{\star}\right\rangle\\|\mathbf{w}\\|+b^{\prime}\right)$		(89)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{-}^{\star}\right\rangle\\|\mathbf{w}\\|+b^{\prime}\right)+\lambda g\left(\\|\mathbf{w}\\|\right)$
	$\displaystyle\quad\geq\min_{\mathbf{w}\in A,b}\kappa\ell_{+}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{+}^{\star}\right\rangle\\|\mathbf{w}\\|+b\right)$		(90)
	$\displaystyle\quad\quad+(1-\kappa)\ell_{-}\left(\left\langle\frac{\mathbf{w}}{\\|\mathbf{w}\\|},\Phi_{-}^{\star}\right\rangle\\|\mathbf{w}\\|+b\right)+\lambda g\left(\\|\mathbf{w}\\|\right)$
	$\displaystyle\quad\geq\min_{\mathbf{w}\in A,b}L(f_{2}\circ F_{1}^{\star},S)$		(91)
	$\displaystyle\quad\geq\min_{\mathbf{w},b}L(f_{2}\circ F_{1}^{\star},S)$		(92)
	$\displaystyle\quad=L(f_{2}^{\star}\circ F_{1}^{\star},S).$		(93)

This proves the result. ∎

2. Example Loss Functions Satisfying Conditions of Theorem IV.1

The empirical risk of many popular loss functions can be decomposed into two terms with one nondecreasing and the other nonincreasing such that the condition required by Theorem IV.1 is satisfied.

•

softmax + cross-entropy (two-class version):

	$\displaystyle\ell(F,S)=\frac{1}{n}\sum_{i=1}^{n}-\mathbbm{1}_{\{i\in I_{+}\}}\ln\left(\sigma\left(F(\mathbf{x}_{i})\right)\right)$		(94)
	$\displaystyle\quad\quad\quad-\mathbbm{1}_{\{i\in I_{-}\}}\ln\left(1-\sigma\left(F(\mathbf{x}_{i})\right)\right)$
	$\displaystyle=\frac{1}{n}\sum_{i\in I_{+}}\ln\left(e^{-F(\mathbf{x}_{i})}+1\right)+\frac{1}{n}\sum_{j\in I_{-}}\ln\left(e^{F(\mathbf{x}_{j})}+1\right),$		(95)

where $\sigma$ is the softmax nonlinearity.

•

tanh + mean squared error (this decomposition works for any monotonic nonlinearity with the value of $y_{i}$ adjusted for the range of the nonlinearity):

	$\displaystyle\ell(F,S)=\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-\delta\left(F(\mathbf{x}_{i})\right)\right)^{2}$		(96)
	$\displaystyle\,=\frac{1}{n}\sum_{i\in I_{+}}\left(1-\delta\left(F(\mathbf{x}_{i})\right)\right)^{2}+\frac{1}{n}\sum_{j\in I_{-}}\left(1+\delta\left(F(\mathbf{x}_{j})\right)\right)^{2}$		(97)

where $\delta$ is the hyperbolic tangent nonlinearity.

•

hinge loss:

	$\displaystyle\ell(F,S)=\frac{1}{n}\sum_{i=1}^{n}\max(0,1-y_{i}F(\mathbf{x}_{i}))$		(98)
	$\displaystyle\,=\frac{1}{n}\sum_{i\in I_{+}}\max(0,1-F(\mathbf{x}_{i}))+\frac{1}{n}\sum_{j\in I_{-}}\max(0,1+F(\mathbf{x}_{j})).$		(99)

3. Additional Transferability Results