Meta-Ensemble Parameter Learning

Zhengcong Fei, Shuman Tian, Junshi Huang, Xiaoming Wei, Xiaolin Wei
Meituan
Beijing, China
{name}@meituan.com
Corresponding author.

Abstract

Ensemble of machine learning models yields improved performance as well as robustness. However, their memory requirements and inference costs can be prohibitively high. Knowledge distillation is an approach that allows a single model to efficiently capture the approximate performance of an ensemble while showing poor scalability as demand for re-training when introducing new teacher models. In this paper, we study if we can utilize the meta-learning strategy to directly predict the parameters of a single model with comparable performance of an ensemble. Hereto, we introduce WeightFormer, a Transformer-based model that can predict student network weights layer by layer in a forward pass, according to the teacher model parameters. The proprieties of WeightFormer are investigated on the CIFAR-10, CIFAR-100, and ImageNet datasets for model structures of VGGNet-11, ResNet-50, and ViT-B/32, where it demonstrates that our method can achieve approximate classification performance of an ensemble and outperforms both the single network and standard knowledge distillation. More encouragingly, we show that WeightFormer results can further exceeds average ensemble with minor fine-tuning. Importantly, our task along with the model and results can potentially lead to a new, more efficient, and scalable paradigm of ensemble networks parameter learning.

1 Introduction

As machine learning models are being deployed ever more widely in practice, memory cost and inference efficiency become increasingly important (Bucilua et al., 2006; Polino et al., 2018). Ensemble methods, which train several independent models to form a decision, are well known to yield both improved performance and reliable estimations (Perrone & Cooper, 1992; Drucker et al., 1994; Opitz & Maclin, 1999; Dietterich, 2000; Sagi & Rokach, 2018). Despite their useful property, using ensembles can be computationally prohibitive. Obtaining predictions in real-time applications is often expensive even for a single model, and the hardware requirements for serving an ensemble scales linearly with number of teacher models (Buizza & Palmer, 1998; Bonab & Can, 2019). As a result, over the past several years the area of knowledge distillation has gained increasing attention (Hinton et al., 2015; Freitag et al., 2017; Malinin et al., 2019; Lin et al., 2020; Park et al., 2021; Zhao et al., 2022). Broadly speaking, distillation methods aim to involve a single student model which can approximate the behavior of a teacher ensemble, but at a low computational cost. In the simplest and most frequently used form of distillation (Hinton et al., 2015), the student model is trained to capture the average prediction of the ensemble, e.g., in the case of image classification, this reduces to minimizing the KL divergence between the soft labels of student model and teacher models.

Refer to caption — Figure 1: Illustration of different knowledge induction frameworks.

When optimizing the parameters for a new ensemble model, typical knowledge distillation process disregards information on teacher model parameters and past training experience for distillation of teacher models. However, leveraging this training information can be the key to reduce the high computational demands. To progress in this direction, we propose a new task, referred to as Meta-Ensemble Parameter Learning, where parameters of the distillation student model are directly predicted with a weight prediction network. The main idea is to use deep learning models to learn the parameter distillation process and finally generate an entire student model by producing all model weights in a single pass. This can reduce the overall computation cost in cases where the tasks or ensemble models update frequently. It is important to highlight that meta-ensemble parameter learning, to our knowledge, has not been previously investigated. Figure 1 depicts various information transfer paradigms, including model ensemble, knowledge distillation, and meta-ensemble parameter learning. The dotted line represents the training flow.

To cover this task, we introduce WeightFormer, a model to directly predict the distilled student model parameters. Our architecture takes inspiration from the Transformer (Vaswani et al., 2017) and incorporates two key novelties to imitate the characteristics of model ensemble, i.e., cross-layer information flow and shift consistency constraint. By designing these updated techniques, we then evaluate the classification performance obtained by the predicted parameters on conventional convolutional architectures VGGNet-11, ResNet-50, and transformer architecture ViT-B/32 (Dosovitskiy et al., 2020), on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively. Experimental results show that the predicted models, yielded by our proposed method in one forward pass, approach the performance of average ensemble and outperform the regular knowledge distillation models significantly. Besides, our WeightFormer further exceeds the average ensemble with fine-tuning, which is expected to hold good fitness on the variability and complexity of application scenarios.

Overall, our framework and results pave the road toward a new and significantly more efficient paradigm for ensemble model parameter learning. The contributions of this paper are summarized as follows: $i$ ) We introduce a novel task of directly predicting parameters of the distillation student model based on multiple teacher neural network parameters, which encourages the exploring of past ensemble training experience to improve the performance as well as reduce computation demand; $ii$ ) We design WeightFormer, a simple and effective benchmark, with adjustments of built cross-layer information flow and shift consistency constraints, to track progress on the model weight generation task. Experimentally, our approach performs surprisingly well and robustly on different model architectures and datasets; $iii$ ) We show that WeightFormer can be transferred to the scenario of weight generation for unseen teacher models in a single forward pass, and more competitive results can be obtained with additional fine-tuning data. Moreover, to improve the reproducibility and foster new research in this field, we will publicly release the source code and trained models.

2 Task Formulation

Here we consider the problem of distilling a neural network from several trained neural networks, also known as teacher-student paradigm (Hinton et al., 2015), on image classification task (Rokach, 2010). It essentially aims to train a single student model that capture the mean decision of an ensemble, allowing to achieve a higher performance with a far lower computation cost. This problem can be formalized as finding optimal parameters $\widetilde{w}$ for target neural network $\widetilde{f}$ , given a neural network set $\mathcal{F}=\{f_{1},\ldots,f_{N}\}$ parameterized by $\mathcal{W}=\{w_{1},\ldots,w_{N}\}$ , w.r.t. a loss function on the dataset $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{M}$ of input image $x_{i}$ and ground truth label $y_{i}$ :

\displaystyle\min_{\widetilde{w}}\sum_{i=1}^{M}\text{KL}(\widetilde{p}(x_{i})||\frac{1}{N}\sum_{n=1}^{N}p_{n}(x_{i})),

(1)

where the optimization objective includes Kullback-Leible divergence denoted as KL( $\cdot$ ) between the mean soft labels from teacher models and the predictions from student model. $p_{n}(\cdot)$ is the output distribution of $n$ -th network and $\widetilde{w}$ is the resulting parameters of ensemble distillation model. Here we assume all the teacher models hold the same network architecture and leave the ensemble learning of heterogeneous models as future work. Despite the progress in memory saving for distillation ensemble model $\widetilde{f}$ , obtaining $\widetilde{w}$ remains a bottleneck in large-scale machine learning pipelines. In particular, with the growing size of network, the classical process of obtaining ensemble parameters, retraining from scratch, is becoming computationally unsustainable.

In this paper, we highlight that the knowledge of preceding ensemble training in parameter optimization is also important and propose a new task, named Meta-Ensemble Parameter Learning, where parameters of distillation ensemble model are directly predicted with deep learning network. Formally, the task aims to generate the parameter $\widetilde{w}$ of target model $\widetilde{f}$ in a single forward pass using a specific weight generation network $g_{\theta}$ , parameterized by $\theta$ :

\widetilde{w}=g([w_{1},\ldots,w_{N}];\theta).

(2)

This task is constrained to a dataset $\mathcal{D}$ , so $\widetilde{w}$ is the predicted parameter for which the test set performance of $\widetilde{f}(x;\widetilde{w})$ is approximate to the performance of model ensemble while maintaining a training efficiency and scalability. In this manner, we can even distill the unseen teacher models to achieve competitive performance without any training cost.

3 Methodology

In this section, we will describe our approach, dubbed as WeightFormer, to serve as an effective solution for meta-ensemble parameter learning based on the Transformer structure. For simplicity, we describe the prediction of CNN models containing a set of convolutional layers and two fully-connected logits layers as well as self-attention layers in Transformer. Please note that most of common parametric layers can be predicted by WeightFormer as presented in experiments.

3.1 Representation of Model Parameters

For weight matrices in different layers of teacher models, provided with the convolutional kernel size $k$ and input / output channel number $n_{input}$ / $n_{output}$ , we consider the encoding of $k\times k\times n_{input}\times n_{output}$ convolutional kernels as $n_{output}$ tokens with weight slices of $k^{2}\times n_{input}$ dimensionality and fully-connected logits layer $n_{input}\times n_{output}$ weights as $n_{output}$ tokens with dimensionality of $n_{input}$ (Zhmoginov et al., 2022). For parameters of self-attention layer, which includes $h$ projection query $W^{Q}\in\mathbb{R}^{{d}_{trans}\times d_{k}}$ , key $W^{K}\in\mathbb{R}^{{d}_{trans}\times d_{k}}$ , value $W^{V}\in\mathbb{R}^{{d}_{trans}\times d_{v}}$ , and one concatenation matrix $W^{O}\in\mathbb{R}^{hd_{v}\times{d}_{trans}}$ , the weight sequence is modeled as $2d_{k}+2d_{v}$ length and $h\times{d}_{trans}$ dimension. The concatenated weight sequence of different models is then fed into weight embedding dictionary layer by layer to achieve identical dimensional as WeightFormer model $d_{model}$ . The weight embedding dictionary $\mathcal{D}_{w}:\mathbb{R}^{d_{layer}}\rightarrow\mathbb{R}^{d_{model}}\text{ and }\mathbb{R}^{d_{model}}\rightarrow\mathbb{R}^{d_{layer}}$ includes a cluster of dimension adjustment embedding functions for different network layers, where $d_{layer}$ denotes the dimension of original weight token. To be specific, it normalize the input weight dimension into $d_{model}$ before WeightFormer and revert the $d_{model}$ into output weight dimension after re-mapped with $\mathcal{D}_{w}$ . So that, we can easily extend our WeightFormer for variant layers with a shared weight generator in a unified way.

To make our WeightFormer model obtaining the ability to distinguish among the different parts of input weights and making use of the order of sequence, the final representation for each weight token is obtained by summing up its weight feature embedding, position embedding, and model id embedding, as illustrated in Figure 2. Note that we use relative position to distinguish the parameters of output dimension in one model.

3.2 Design of Architecture

As introduced in the preceding discussion, modeling of parameter predictor $g$ is the core of the parameter learning task and here we choose $g$ to be a model of stacked Transformer blocks that takes the parameter of teacher models $\{w_{1},\ldots,w_{N}\}$ as input and produces weights for some or all layers $\widetilde{w}$ of the target ensemble model. Generally, the weights are generated layer-by-layer from the first layer. Next, we will introduce the key novelties in detail.

Cross-Layer Information Flow.

When the WeightFormer predicts the parameters of specific layer, the cross-layer information should be incorporated as prior context for holistic parameters prediction, as the current layer is not independent with the previous layers. To achieve this goal, we extend the original weight matrices sequence with an extra vector [cross], which will be inserted at the first position and fully attended with other weight tokens in a global perspective. The corresponding output hidden states, which contain the layer information and serve as cross-layer features, are fed into the parameter prediction process of the next generated layer. The value of [cross] token in the generation of the first layer is a randomly initialized learnable vector. Particularly, we claim that the output hidden states implicitly model the embedding of current layer index, and is critical for layer-specific generation.

Shift Consistency.

Motivated by the fact that for the same teacher models with different input orders, the outputs of ensemble parameter prediction should be identical. Hereto, we develop a shift consistency loss to regularize the parameter prediction with shifted input sequences. As shown in the right part of Figure 2, we introduce a model weight shifted topology. Concretely, given the input weight sequence $\mathcal{W}$ at each training step, we feed the same input $\mathcal{W}$ to go through the forward pass of the WeightFormer network twice, one with the original model weight order, and the other with the one-model shifted order. For these two patterns, we use different weight cutoff (Budczies et al., 2012), where the dropped values are chosen from feature dimension of $d_{model}$ . Therefore, we can obtain two parameter predictions $\widetilde{w}$ and $\widetilde{w}^{shift}$ , respectively. In order to make those two predicted weights closing to each other, we propose to minimize the mean square error between them as:

\mathcal{L}_{consist}=||\widetilde{w}-\widetilde{w}^{shift}||^{2}.

(3)

Intuitively, the regularization naturally forces the outputs of WeightFormer with different input orders of teacher models to be identical and invariant. After the entire input weight sequence is processed by WeightFormer, we extract the first $n_{output}$ tokens starting from [cross] and assemble the final student model by adjusting the weights’ dimension with layer-specific embedding dictionary.

3.3 Training Procedure

Instead of direct learning for student model parameters, we follow the bi-level optimization paradigm common in meta-learning (Hospedales et al., 2020; Liu et al., 2021a) to train the WeightFormer model as:

\mathcal{L}_{comb}=\sum_{(x_{i},y_{i})\in\mathcal{D}}\mathcal{L}_{cross}(f_{t}(x_{i};\widetilde{w}),y_{i})+\alpha\mathcal{L}_{consist},

(4)

where $\mathcal{L}_{cross}$ is the single-label classification cross-entropy loss and $\alpha$ denotes the balancing factors for consistency loss. By optimizing the joint loss, the WeightFormer $g$ gradually obtains the capacity of how to capture the intrinsic knowledge of network parameters from teacher models.

During training, the WeightFormer model inputs the combination of the teacher models’ parameters layer by layer to produce the weights of all layers of student model. Then, the loss in Equation 4 is computed for the training set samples that are passed through the generated student model. The weight generation parameters $\theta$ of WeightFormer are learned by optimizing the loss function using stochastic gradient descent. To avoid too much costs on reloading parameter at each step, we re-sample the teacher model checkpoints from model set only meets the break conditions, e.g., reaching every predefined steps or converging on validation set. The overall algorithm is presented in Algorithm 1. Encouragingly, our WeightFormer shows a good generalization and predicts good parameters for different architectures, even for unseen teacher models.

Data: Training data

\mathcal{D}_{train}

, validation data

\mathcal{D}_{val}

, model set

\mathcal{F}

, model ensemble number

n

1 for sample $n$ models $\in\mathcal{F}$ do

2 for sample data pair $(x_{i},y_{i})$ $\in\mathcal{D}$ do

\widetilde{w}=g_{\theta}(w_{1},\ldots,w_{n})

;

4 compute loss

\mathcal{L}_{comb}

;

5 loss backward for

\theta

;

6 if stop condition $(\widetilde{w},\mathcal{D}_{val})$ then

7 break;

8 end if

10 end for

12 end for

Algorithm 1 Optimizing for WeightFormer model g_θ

4 Experiments

This section focuses on the evaluation of WeightFormer on the ensemble parameter learning task. We begin with experimental setup in Sec. 4.1 and compare our main results with various baselines in Sec. 4.2. In between, we show beneficial side-effects of predicting ensemble weights and discuss the implications of our findings. Finally, we conduct more analysis on the design of WeightFormer in Sec. 4.3 for empirical study.

4.1 Experimental Settings

Datasets and Metrics.

We utilize three common image classification datasets CIFAR-10 (C-10), CIFAR-100 (C-100) (Krizhevsky et al., 2009), and ImageNet (Deng et al., 2009) for experiments. To evaluate the image classification performance, we adopt the metrics as 1) top- $N$ accuracy (ACC- $n$ ), which means any of model’s top $n$ highest probability predictions match with the expected results; 2) expected calibration error (Guo et al., 2017) (ECE), which simply takes a weighted average over the absolute accuracy / confidence difference.

Compared Baselines.

The following baselines for knowledge induction are considered: 1) Single refers to the average performance across different individual, independently evaluated models. 2) Ensemble refers to the performance of a deep ensemble that average the sum of logits from ensemble models, which is currently considered to be both the best and the simplest ensemble method. 3) KD refers to ensemble distillation via minimizing the KL divergence between the student and the ensemble’s mean for image classification. In practice, KD is executed with a fixed temperature and same structure to teacher network following (Hinton et al., 2015). 4) MLP denotes a simple MLP that only has access to weight parameters of teacher models to predict the target ensemble parameter, but not to the connections between different layers. Note that the prediction of parameters for each layer requires an independent MLP network.

Implementation Details.

WeightFormer can in principle generate arbitrarily large weight tensors by concatenating all the parameters of teacher models as input and then predicting the student model weight tensors layer by layer. For model architecture, the proposed WeightFormer model closely follows the same network architecture and hyper-parameters settings as an encoder version of Transformer model (Vaswani et al., 2017). Specifically, the number of stacked blocks for the transformer block encoder is set to 24, hidden size is set to 1280, attention head number is set to 20, and feed-forward network size is set to 4096. For training process, to avoid the collapse, we first pre-training the WeightFormer with the supervised of knowledge distillation model parameters in L2 matching loss using 1e-3 SGD. Then, we first sample checkpoints from model set and data pair from training set, and then train the WeightFormer model in combined loss with an initial learning rate of 3e-5, and it decays by 0.9 every five epochs. To reduce the cost of parameter loading at each step, we reload or sample model checkpoints every 5k steps. The overall training process is terminated until accuracy achieves no increment in the validation set. The hyper-parameter factors $\alpha$ is set to 1.0 according to prior experiments.

4.2 Compare with knowledge Induction Methods

We first turn our attention to ensemble model parameter prediction with different knowledge in methods. This section presents our main results, which shows that proposed WeightFormer models can be served as a competitive alternative to the conventional procedure of ensemble methods which require no additional training or memory cost.

Experimental Setup.

To evaluate the performance of ensemble parameter learning, a model set including 72 checkponits is constructed by training with different random seed initializations and hyperparameters dependently, for VGGNet-11, ResNet-50, and ViT-B/32 versions on the C-10, C-100, and ImageNet datasets. In each setting, 60 models and 12 models is splitted for training and evaluation. Checkpoints of ViT-B/32 are loaded from repository ¹¹1https://github.com/mlfoundations/model-soups and please refer to (Wortsman et al., 2022) for the detail settings of hyperparameter. As the original downloaded model of ViT-B/32 fine-tuning on the ImageNet, we only conduct experiments on the ImageNet dataset for performance comparison. We consider the scenario of 3 model ensemble, i.e., 3 checkpoints are randomly sampled from model pools of training and evaluation, respectively. The checkpoints for evaluation is identical for different methods for fair comparison. Except for the four baselines introduced in section 4.1, we also use WF*, which denotes fine-tunning the weightformer with sampled unseen checkpoints in the corresponding training dataset until convergence in the validation set.

Dataset	Metrics	Single	Ensemble	KD	MLP	WF	WF*
	ACC-1	92.4 ${\pm}$ 0.6	93.8 ${\pm}$ NA	92.8 ${\pm}$ 0.4	92.5 ${\pm}$ 0.6	93.3 ${\pm}$ 0.3	93.9 ${\pm}$ 0.2
C-10	ACC-5	98.8 ${\pm}$ 0.2	99.8 ${\pm}$ NA	99.1 ${\pm}$ 0.1	98.8 ${\pm}$ 0.4	99.3 ${\pm}$ 0.1	99.8 ${\pm}$ 0.1
	ECE	2.5 ${\pm}$ 0.6	1.1 ${\pm}$ NA	2.3 ${\pm}$ 0.5	2.5 ${\pm}$ 0.5	1.4 ${\pm}$ 0.3	1.0 ${\pm}$ 0.2
	ACC-1	69.8 $\pm$ 0.5	72.6 $\pm$ NA	71.3 $\pm$ 0.3	70.0 $\pm$ 0.6	72.0 $\pm$ 0.3	72.9 $\pm$ 0.3
C-100	ACC-5	90.5 $\pm$ 0.3	92.4 $\pm$ NA	91.0 $\pm$ 0.1	90.7 $\pm$ 0.2	91.6 $\pm$ 0.1	92.5 $\pm$ 0.2
	ECE	10.1 $\pm$ 0.5	3.2 $\pm$ NA	7.7 $\pm$ 0.4	9.8 $\pm$ 0.5	5.4 $\pm$ 0.3	3.0 $\pm$ 0.3
	ACC-1	70.3 $\pm$ 0.5	73.4 $\pm$ NA	71.1 $\pm$ 0.2	70.5 $\pm$ 0.8	72.2 $\pm$ 0.5	73.6 $\pm$ 0.4
ImageNet	ACC-5	89.8 $\pm$ 0.4	91.4 $\pm$ NA	90.3 $\pm$ 0.2	89.9 $\pm$ 0.2	90.9 $\pm$ 0.1	91.3 $\pm$ 0.2
	ECE	19.6 $\pm$ 0.8	5.5 $\pm$ NA	9.2 $\pm$ 0.3	18.5 $\pm$ 0.7	6.3 $\pm$ 0.5	5.2 $\pm$ 0.3

Table 1: Experimental results on the C-10 / C-100 / ImageNet classification datasets across three teacher models of VGGNet-11 architecture

\pm 2\sigma

Dataset	Metrics	Single	Ensemble	KD	MLP	WF	WF*
	ACC-1	93.6 ${\pm}$ 0.7	94.7 ${\pm}$ NA	93.9 ${\pm}$ 0.4	93.5 ${\pm}$ 0.5	94.3 ${\pm}$ 0.6	94.8 ${\pm}$ 0.3
C-10	ACC-5	99.7 ${\pm}$ 0.1	99.8 ${\pm}$ NA	99.7 ${\pm}$ 0.1	99.7 ${\pm}$ 0.1	99.8 ${\pm}$ 0.1	99.8 ${\pm}$ 0.05
	ECE	2.1 ${\pm}$ 0.5	0.9 ${\pm}$ NA	1.8 ${\pm}$ 0.5	2.3 ${\pm}$ 0.5	1.6 ${\pm}$ 0.4	1.0 ${\pm}$ 0.2
	ACC-1	77.4 $\pm$ 0.8	79.7 $\pm$ NA	78.2 $\pm$ 0.4	77.6 $\pm$ 0.4	78.9 $\pm$ 0.4	79.9 $\pm$ 0.4
C-100	ACC-5	93.9 $\pm$ 0.2	94.8 $\pm$ NA	94.1 $\pm$ 0.2	94.0 $\pm$ 0.1	94.5 $\pm$ 0.1	94.9 $\pm$ 0.1
	ECE	10.4 $\pm$ 0.6	3.0 $\pm$ NA	6.3 $\pm$ 0.6	9.3 $\pm$ 0.5	4.3 $\pm$ 0.4	2.7 $\pm$ 0.2
	ACC-1	76.1 $\pm$ 0.6	78.3 $\pm$ NA	77.1 $\pm$ 0.3	76.2 $\pm$ 0.3	77.9 $\pm$ 0.3	78.7 $\pm$ 0.5
ImageNet	ACC-5	92.8 $\pm$ 0.3	94.1 $\pm$ NA	93.2 $\pm$ 0.2	92.8 $\pm$ 0.1	93.8 $\pm$ 0.1	94.3 $\pm$ 0.2
	ECE	16.5 $\pm$ 1.0	4.6 $\pm$ NA	10.3 $\pm$ 0.8	16.8 $\pm$ 0.6	6.5 $\pm$ 0.5	4.5 $\pm$ 0.2

Table 2: Experimental results on C-10 / C-100 / ImageNet classification datasets across three teacher models of ResNet-50 architecture

\pm 2\sigma

Main Results.

Table 1, 2 and 3 report the evaluation results of our ensemble parameter distillation experiments for network architectures of VGGNet-11, ResNet-50, and ViT-B/32 on the C-10, C-100, and ImageNet datasets. We can see that 1) The ensemble approach significantly outperforms single model. This is canonical approach and therefore ensemble strategy is generally adopted when the absolute best score is desired. 2) Another interesting result is that we are able to come rather close to the performance of the ensemble with our WeightFormer approach for unseen model checkpoints. As a result, to fit different teacher models, KD would require more training costs and be limited by the model scalability. Given that our goal is to obtain the performance of ensemble in one feed forward pass without retraining, we believe that this result is considerably successful. 3) Our WeightFormer outperforms the single model baseline by an average of 1.7% improvements on ACC-1 metric on the ImageNet dataset and similar trends as on the other two datasets. Meantime, for ECE, which compares neural network model output pseudo-probabilities to model accuracy, the classification performance changes are more evident. 4) Results from WeightFormer are shown with less variance and more robust prediction performance compared with KD, which illustrates more superiority of the proposed WeightFormer. 5) Simple MLP baseline shows a much poor transfer capacity for unseen checkpoints, while Transformer with well designed improvements hold more promising. 6) More encouraging, fine-tuned weightformer with training data and teacher model parameters, the finally predicted model achieves better performance compared with average ensemble method in some metrics, e.g., 0.3% ACC-1 improvement in ViT-B/32 architecture of ImageNet dataset, demonstrating the promising of meta ensemble paradigm.

Dataset	Metrics	Single	Ensemble	KD	MLP	WF	WF*
	ACC-1	78.3 $\pm$ 1.2	80.2 $\pm$ NA	79.0 $\pm$ 0.8	78.4 $\pm$ 0.5	79.8 $\pm$ 0.4	80.5 $\pm$ 0.3
ImageNet	ACC-5	93.5 $\pm$ 0.3	94.9 $\pm$ NA	94.0 $\pm$ 0.2	93.9 $\pm$ 0.1	94.8 $\pm$ 0.2	95.1 $\pm$ 0.2
	ECE	2.6 $\pm$ 0.4	1.9 $\pm$ NA	2.3 $\pm$ 0.3	2.6 $\pm$ 0.3	2.5 $\pm$ 0.4	1.7 $\pm$ 0.3

Table 3: Experimental results on the ImageNet classification dataset across three teacher models of ViT-B/32 architecture

\pm 2\sigma

4.3 Model Analysis

In this section, we conduct analysis from different perspectives to better understand the advantages and properties of proposed WeightFormer, for the ensemble teacher models of VGGNet-11 / ViT-B on both C-10, C-100, and ImageNet image classification datasets.

Impact of Different Model Components.

We investigate the impact of the different components in our proposed WeightFormer framework including cross-layer fusion and shift consistency (whether set the factor $\alpha$ in Equation 4 to zero), in the ensemble parameter learning task. To discriminate the contribution of consistency loss or used weight cutoff, we also test the baseline without cutoff augmentation. Table 4 shows the evaluation results of evaluation metrics on different variants. We can see that: 1) each component of the model contributes positively to its final classification performance and shift consistency yields the weight generation improvement most and 2) For different teacher model architectures, e.g., CNN or Transformer, different weightformer components hold different benefits. In particular, the more complex the teacher model structure, the higher the predicted ensemble revenue. For more advanced structure, e.g., Swin Transformer (Liu et al., 2021b), it is worth exploring in the future.

Model	VGGNet-11				ViT-B
Dataset	C-10		C-100		ImageNet
Model variant	ACC-1	ACC-5	ACC-1	ACC-5	ACC-1	ACC-5
WeightFormer	93.3 ${\pm}$ 0.3	99.3 ${\pm}$ 0.1	72.0 ${\pm}$ 0.2	91.6 ${\pm}$ 0.1	80.5 ${\pm}$ 0.3	95.1 ${\pm}$ 0.2
- Cross-layer Fusion	93.1 ${\pm}$ 0.3	99.3 ${\pm}$ 0.2	71.8 ${\pm}$ 0.3	91.3 ${\pm}$ 0.2	80.2 ${\pm}$ 0.4	95.0 ${\pm}$ 0.2
- Shift Consistency	93.0 ${\pm}$ 0.4	99.2 ${\pm}$ 0.06	71.4 ${\pm}$ 0.5	91.1 ${\pm}$ 0.1	79.9 ${\pm}$ 0.5	94.8 ${\pm}$ 0.2
- Weight Cutoff	93.0 ${\pm}$ 0.3	99.2 ${\pm}$ 0.1	71.6 ${\pm}$ 0.4	91.2 ${\pm}$ 0.1	80.1 ${\pm}$ 0.3	94.8 ${\pm}$ 0.1

Table 4: Impact when changing different components in WeightFormer framework.

Effect of Incorporated Ensemble Model Number.

Common sense shows that increasing the number of incorporated teacher models improves the final performance of the ensemble in an underlying task. It is natural to question whether our parameter prediction can benefit from more teacher model numbers. Hence, we conduct an ablation over the teacher model number using the zero-shot VGGNet-11 model checkpoints over two configurations of our method: $i$ ) Heurastic: select two teacher models without replacement each time for ensemble parameter prediction, and put the newly generated model into the teacher model set, and $ii$ ) Concatenate: directly concatenate all teacher model parameters to predict the target weights. The results are summarized in Figure 3. For each teacher model number and configuration, we directly predict the network weight with trained weightformer and report the metric scores over the test sets. Note that considering the memory constraints, the method of concatenation only reports part of the model number results. As can be seen, increasing the teacher model number results in significant improvements in test evaluation. More importantly, the Transformer architecture allows increasing the incorporated teacher model number with only a marginal increment to number of learnable parameters.

Parameter Efficiency for WeightFormer.

Generally, our proposed approach learns a layer weight embedding per layer, consisting of $\sum_{l}n_{weight,l}\times d_{model}$ parameters, where $n_{weight,l}$ denotes the dimensional of input weight vector, i.e., $n_{input}\times k\times k$ for convolutional layer of kernel size $k$ and $n_{in}$ for full-linear layer and $d_{model}$ is the model dimensionality for Transformer block. We additionally employ model id and relative positional embeddings, which require $(N+L_{max})\times d_{model}$ parameters, where $N$ is the teacher model number and $L_{max}$ is the max sequence length. It is deserve to highlight that the structure of Transformer block in WeightFormer maintains the same in different ensemble scenarios as well as different layers. In settings with a large number of layers and number of teacher model number $N$ , our method shows much more parameter-efficient compared with MLP method.

5 Related Works

Model Ensemble and Knowledge Distillation.

Combining the outputs of several independent models is a foundational technique for improving the accuracy and robustness of machine learning systems (Dietterich, 2000; Bauer & Kohavi, 1999; Breiman, 1996; Lakshminarayanan et al., 2017; Freund & Schapire, 1997; Ovadia et al., 2019). (Gontijo-Lopes et al., 2021) conduct a large-scale study of ensembles, finding that higher divergence in training methodology leads to less uncorrelated errors and better ensemble accuracy. Meantime, plenty of works have explored building ensembles of models produced by hyperparameter searches (Mendoza et al., 2016; Saikia et al., 2020) and greedy selection strategies (Caruana et al., 2004; 2006; Lévesque et al., 2016; Wenzel et al., 2020). Importantly, ensembles require a separate inference pass through each model, which increases computational cost. When the number of models is large, this can be prohibitively expensive (Gou et al., 2021; Wang & Yoon, 2021). Accordingly, knowledge distillation is proposed to allow the weak model (student) to learn existing knowledge from one-or-more strong models (teacher) by utilizing the output probability distributions as soft labels (Ba & Caruana, 2014; Hinton et al., 2015). Such a paradigm can greatly reduce the parameters of the model without reducing the performance of the model (Kim & Rush, 2016). (Freitag et al., 2017) transferred the knowledge of the ensemble model to a single model. Numerous works aim to refine and propagate core knowledge from the teacher networks, e.g., auto-encoders (Kim et al., 2018), data instances relation (Park et al., 2019; Tung & Mori, 2019; Peng et al., 2019; Liu et al., 2019), mutual information (Ahn et al., 2019), and contrastive learning (Tian et al., 2019; Chen et al., 2021). However, all of the above knowledge distillation methods focused on the transfer of network output features, it is still a remaining problem if we can link knowledge with parameters between the teacher and student, which motivates our exploration work on ensemble parameter learning.

Network Weight Modulation and Generation.

The idea of using neural network and task-specification information to directly generate or modulate model weights has been previously explored (Bertinetto et al., 2016; Ratzlaff & Fuxin, 2019; Mahabadi et al., 2021; Tay et al., 2020; Ye & Ren, 2021). Some few-shot learning methods described above employ this approach and use task-specific generation or modulation of the weights of the final classification model (Snell et al., 2017; Lian et al., 2019; Zhmoginov et al., 2022). For example, the matching network in LGM-Net (Li et al., 2019) is used to generate a few layers on top of a task-agnostic embedding. LEO (Rusu et al., 2018) utilized a similar weight generation method to generate initial model weights from the training dataset in a few-shot learning setting. These methods are tied to a particular architecture and need to be trained from scratch if it is changed. (Knyazev et al., 2021; Elsken et al., 2020) resorts to neural architecture search with graph network for more general parameter prediction. Most recently, (Wortsman et al., 2022) propose model soup, to averages the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy, which can be regarded as a special case of our parameter learning. Inspired by the approach (Ha et al., 2016), we explore a similar idea, to directly generate an entire student model according to the parameters of teacher models, with simplifies and stabilizes training.

6 Conclusion

In this paper, we focus on ensemble parameter learning, which aims to directly generate the weights of distilled ensemble model. Accordingly, we introduce WeightFormer, a new Transformer architecture that leverages the representation of teacher learner parameters for distillation weight generation. The proposed approach is able to predict parameters for a distillation ensemble of diverse teacher models with cross-layer information flow modeling and shift consistency constraints. The network with predicted parameters yields surprisingly high image classification accuracy given the challenging scenario of our task. More importantly, we showed that WeightFormer can be straightforwardly extended to handle unseen teacher models compared with knowledge distillation and even exceeds average ensemble with small-scale fine-tuning. We believe our task along with the model and results can potentially lead to a new paradigm of ensemble networks.

Broader Impact and Limitations.

Currently, to avoid the heavy memory cost in the model ensemble, advanced knowledge distillation methods are introduced. However, poor training extensibility restricts its application in more flexible scenarios. Our work, meta-ensemble parameter learning, takes the first step toward a general framework for direct parameter prediction from multiple teacher model parameters with conspicuous promise. However, on the other hand, one potential issue of this work is that it does not explore the weight generation for different teacher architectures. We advocate peer researchers look into this, making weight generation more reliable in different application scenarios and network architectures. Other than that, since this work is mostly on the discovery of ensemble weight generation, we do not foresee any direct negative impacts on society.

References

Ahn et al. (2019) Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil D Lawrence, and Zhenwen Dai. Variational information distillation for knowledge transfer. In Proc. IEEE CVPR, pp. 9163–9171, 2019.
Ba & Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? Proc. NIPS, 27, 2014.
Bauer & Kohavi (1999) Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine learning, 36(1):105–139, 1999.
Bertinetto et al. (2016) Luca Bertinetto, João F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning feed-forward one-shot learners. Proc. NIPS, 29, 2016.
Bonab & Can (2019) Hamed Bonab and Fazli Can. Less is more: A comprehensive framework for the number of components of ensemble classifiers. IEEE Transactions on neural networks and learning systems, 30(9):2735–2745, 2019.
Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
Bucilua et al. (2006) Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proc. ACM SIGKDD, pp. 535–541, 2006.
Budczies et al. (2012) Jan Budczies, Frederick Klauschen, Bruno V Sinn, Balázs Győrffy, Wolfgang D Schmitt, Silvia Darb-Esfahani, and Carsten Denkert. Cutoff finder: a comprehensive and straightforward web application enabling rapid biomarker cutoff optimization. PloS one, 7(12):e51862, 2012.
Buizza & Palmer (1998) Roberto Buizza and Tim N Palmer. Impact of ensemble size on ensemble prediction. Monthly Weather Review, 126(9):2503–2518, 1998.
Caruana et al. (2004) Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. In Proc. ICML, pp. 18, 2004.
Caruana et al. (2006) Rich Caruana, Art Munson, and Alexandru Niculescu-Mizil. Getting the most out of ensemble selection. In Proc. ICDM, pp. 828–833. IEEE, 2006.
Chen et al. (2021) Liqun Chen, Dong Wang, Zhe Gan, Jingjing Liu, Ricardo Henao, and Lawrence Carin. Wasserstein contrastive representation distillation. In Proc. IEEE CVPR, pp. 16296–16305, 2021.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Proc. IEEE CVPR, pp. 248–255. Ieee, 2009.
Dietterich (2000) Thomas G Dietterich. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. ICLR, 2020.
Drucker et al. (1994) Harris Drucker, Corinna Cortes, Lawrence D Jackel, Yann LeCun, and Vladimir Vapnik. Boosting and other ensemble methods. Neural Computation, 6(6):1289–1301, 1994.
Elsken et al. (2020) Thomas Elsken, Benedikt Staffler, Jan Hendrik Metzen, and Frank Hutter. Meta-learning of neural architectures for few-shot learning. In Proc. IEEE CVPR, pp. 12365–12375, 2020.
Freitag et al. (2017) Markus Freitag, Yaser Al-Onaizan, and Baskaran Sankaran. Ensemble distillation for neural machine translation. arXiv preprint arXiv:1702.01802, 2017.
Freund & Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Gontijo-Lopes et al. (2021) Raphael Gontijo-Lopes, Yann Dauphin, and Ekin D Cubuk. No one representation to rule them all: Overlapping features of training methods. arXiv preprint arXiv:2110.12899, 2021.
Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proc. ICML, pp. 1321–1330. PMLR, 2017.
Ha et al. (2016) David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
Hospedales et al. (2020) Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. arXiv preprint arXiv:2004.05439, 2020.
Kim et al. (2018) Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. Proc. NIPS, 31, 2018.
Kim & Rush (2016) Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. In Proc. EMNLP, 2016.
Knyazev et al. (2021) Boris Knyazev, Michal Drozdzal, Graham W Taylor, and Adriana Romero Soriano. Parameter prediction for unseen deep architectures. Proc. NIPS, 34, 2021.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Proc. NIPS, 30, 2017.
Lévesque et al. (2016) Julien-Charles Lévesque, Christian Gagné, and Robert Sabourin. Bayesian hyperparameter optimization for ensemble learning. arXiv preprint arXiv:1605.06394, 2016.
Li et al. (2019) Huaiyu Li, Weiming Dong, Xing Mei, Chongyang Ma, Feiyue Huang, and Bao-Gang Hu. Lgm-net: Learning to generate matching networks for few-shot learning. In Proc. ICML, pp. 3825–3834. PMLR, 2019.
Lian et al. (2019) Dongze Lian, Yin Zheng, Yintao Xu, Yanxiong Lu, Leyu Lin, Peilin Zhao, Junzhou Huang, and Shenghua Gao. Towards fast adaptation of neural architectures with meta learning. In Proc. ICLR, 2019.
Lin et al. (2020) Tao Lin, Lingjing Kong, Sebastian U Stich, and Martin Jaggi. Ensemble distillation for robust model fusion in federated learning. Proc. NIPS, 33:2351–2363, 2020.
Liu et al. (2021a) Risheng Liu, Jiaxin Gao, Jin Zhang, Deyu Meng, and Zhouchen Lin. Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021a.
Liu et al. (2019) Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan, Weiming Hu, Yangxi Li, and Yunqiang Duan. Knowledge distillation via instance relationship graph. In Proc. IEEE CVPR, pp. 7096–7104, 2019.
Liu et al. (2021b) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proc. IEEE CVPR, pp. 10012–10022, 2021b.
Mahabadi et al. (2021) Rabeeh Karimi Mahabadi, Sebastian Ruder, Mostafa Dehghani, and James Henderson. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In Proc. ACL, pp. 565–576, 2021.
Malinin et al. (2019) Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. In Proc. ICLR, 2019.
Mendoza et al. (2016) Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Towards automatically-tuned neural networks. In Workshop on Automatic Machine Learning, pp. 58–65. PMLR, 2016.
Opitz & Maclin (1999) David Opitz and Richard Maclin. Popular ensemble methods: An empirical study. Journal of artificial intelligence research, 11:169–198, 1999.
Ovadia et al. (2019) Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. Proc. NIPS, 32, 2019.
Park et al. (2021) Dae Young Park, Moon-Hyun Cha, Daesin Kim, Bohyung Han, et al. Learning student-friendly teacher networks for knowledge distillation. Proc. NIPS, 34:13292–13303, 2021.
Park et al. (2019) Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proc. IEEE CVPR, pp. 3967–3976, 2019.
Peng et al. (2019) Baoyun Peng, Xiao Jin, Jiaheng Liu, Dongsheng Li, Yichao Wu, Yu Liu, Shunfeng Zhou, and Zhaoning Zhang. Correlation congruence for knowledge distillation. In Proc. IEEE ICCV, pp. 5007–5016, 2019.
Perrone & Cooper (1992) Michael P Perrone and Leon N Cooper. When networks disagree: Ensemble methods for hybrid neural networks. Technical report, Brown Univ Providence Ri Inst for Brain and Neural Systems, 1992.
Polino et al. (2018) Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. In Proc. ICLR, 2018.
Ratzlaff & Fuxin (2019) Neale Ratzlaff and Li Fuxin. Hypergan: A generative model for diverse, performant neural networks. In Proc. ICML, pp. 5361–5369. PMLR, 2019.
Rokach (2010) Lior Rokach. Ensemble-based classifiers. Artificial intelligence review, 33(1):1–39, 2010.
Rusu et al. (2018) Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. In Proc. ICLR, 2018.
Sagi & Rokach (2018) Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.
Saikia et al. (2020) Tonmoy Saikia, Thomas Brox, and Cordelia Schmid. Optimized generic feature learning for few-shot classification across domains. arXiv preprint arXiv:2001.07926, 2020.
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Proc. NIPS, 30, 2017.
Tay et al. (2020) Yi Tay, Zhe Zhao, Dara Bahri, Donald Metzler, and Da-Cheng Juan. Hypergrid transformers: Towards a single model for multiple tasks. In Proc. ICLR, 2020.
Tian et al. (2019) Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In Proc. ICLR, 2019.
Tung & Mori (2019) Frederick Tung and Greg Mori. Similarity-preserving knowledge distillation. In Proc. IEEE CVPR, pp. 1365–1374, 2019.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proc. NIPS, pp. 5998–6008, 2017.
Wang & Yoon (2021) Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
Wenzel et al. (2020) Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter ensembles for robustness and uncertainty quantification. Proc. NIPS, pp. 6514–6527, 2020.
Wortsman et al. (2022) Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proc. ICML, pp. 23965–23998. PMLR, 2022.
Ye & Ren (2021) Qinyuan Ye and Xiang Ren. Learning to generate task-specific adapters from task description. In Proc. ACL, pp. 646–653, 2021.
Zhao et al. (2022) Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. Decoupled knowledge distillation. In Proc. IEEE CVPR, pp. 11953–11962, 2022.
Zhmoginov et al. (2022) Andrey Zhmoginov, Mark Sandler, and Maksym Vladymyrov. Hypertransformer: Model generation for supervised and semi-supervised few-shot learning. In Proc. ICML, pp. 27075–27098. PMLR, 2022.