¹¹institutetext: College of Computer Studies
De La Salle University
2401 Taft Ave, Malate, Manila, 1004 Metro Manila, Philippines
¹¹email: {abien_agarap, arnulfo.azcarraga}@dlsu.edu.ph
https://dlsu.edu.ph

Abien Fred Agarap

k-Winners-Take-All Ensemble Neural Network

Abien Fred Agarap Arnulfo P. Azcarraga

Abstract

Ensembling is one approach that improves the performance of a neural network by combining a number of independent neural networks, usually by either averaging or summing up their individual outputs. We modify this ensembling approach by training the sub-networks concurrently instead of independently. This concurrent training of sub-networks leads them to cooperate with each other, and we refer to them as “cooperative ensemble”. Meanwhile, the mixture-of-experts approach improves a neural network performance by dividing up a given dataset to its sub-networks. It then uses a gating network that assigns a specialization to each of its sub-networks called “experts”. We improve on these aforementioned ways for combining a group of neural networks by using a k-Winners-Take-All (kWTA) activation function, that acts as the combination method for the outputs of each sub-network in the ensemble. We refer to this proposed model as “kWTA ensemble neural networks” (kWTA-ENN). With the kWTA activation function, the losing neurons of the sub-networks are inhibited while the winning neurons are retained. This results in sub-networks having some form of specialization but also sharing knowledge with one another. We compare our approach with the cooperative ensemble and mixture-of-experts, where we used a feed-forward neural network with one hidden layer having 100 neurons as the sub-network architecture. Our approach yields a better performance compared to the baseline models, reaching the following test accuracies on benchmark datasets: 98.34% on MNIST, 88.06% on Fashion-MNIST, 91.56% on KMNIST, and 95.97% on WDBC.

Keywords:

Theory and algorithms competitive learning ensemble learning mixture-of-experts neural network models.

1 Introduction and Related Works

We use artificial neural networks in a myriad of automation tasks such as classification, regression, and translation among others. Neural networks would approach these tasks as a function approximation problem, wherein given a dataset of input-output pairs $D=\{(x_{i},y_{i})|x_{i}\in\mathcal{X},y_{i}\in\mathcal{Y}\}$ , their goal is to learn the mapping $\mathcal{F}:\mathcal{X}\mapsto\mathcal{Y}$ . They accomplish this by optimizing their parameters $\theta$ with some modification mechanism, such as the retro-propagation of output errors[14]. We then deem the parameters to be optimal if the neural network outputs are as close as possible to the target outputs in the training data, and if it can adequately generalize on previously unseen data. This can be achieved when a network is neither too simple (has a high bias) nor too complex (has a high variance).
Through the years, combining a group of neural networks is among the simplest and most straightforward ways to achieve this feat. The two basic ways to combine neural networks are by ensembling[1, 4, 3, 15], and by using a mixture-of-experts (MoE)[5, 6]. In an ensemble, a group of independent neural networks is trained to learn the entire dataset. Meanwhile in MoE, each network is trained to learn their own and different subsets of the dataset.
In this work, we use a group of neural networks for a classification task on the following benchmark datasets: MNIST [8], Fashion-MNIST[18], Kuzushiji-MNIST (KMNIST)[2], and Wisconsin Diagnostic Breast Cancer (WDBC)[17]. We introduce a variant of ensemble neural networks that uses a $k$ -Winners-Take-All (kWTA) activation function to combine the outputs of its sub-networks instead of using averaging, summation, or voting schemes to combine such outputs. We then compare our approach with an MoE and a modified ensemble network on a classification task on the aforementioned datasets.

1.1 Ensemble of Independent Networks

We usually form an ensemble of networks by independently or sequentially (in the case of boosting) training them, and then by combining their outputs at test time usually by averaging[1] or voting[4]. In this work, we opted to use the averaging scheme for ensembling.
That is, we have a group of neural networks $f_{1},\ldots,f_{M}$ parameterized by $\theta_{1},\ldots,\theta_{M}$ , and we compute its final output as,

\displaystyle o=\frac{1}{M}\sum_{m=1}^{M}f_{m}(x;\theta_{m})

(1)

Each sub-network is trained independently to minimize their own loss function, e.g. cross entropy loss for classification, $\ell_{ce}(y,o)=-\sum y\log(o)$ . Then Eq. 1 is used to get the model outputs at test time.

1.2 Mixture of Experts

The Mixture-of-Experts (MoE) model consists of a set of $M$ “expert” neural networks $E_{1},\ldots,E_{m}$ and a “gating” neural network $G$ [5]. The experts are assigned by the gating network to handle their own subset of the entire dataset. We compute the final output of this model using the following equation,

\displaystyle o=\sum\arg\max G(x)E_{m}(x)

(2)

where $G(x)$ is gating probability output to choose $E_{m}$ for a given input $x$ . The gating network and the expert networks have their respective set of parameters.
Then, we compute the MoE model loss by using the following equation,

\displaystyle\mathcal{L}_{MoE}(x,y)=\frac{1}{M}\sum_{m=1}^{M}\left[\frac{1}{n}\sum_{i=1}^{n}\arg\max G(x_{i})\cdot\ell_{ce}(y_{i},E_{m}(x_{i}))\right]

(3)

where $\ell_{ce}$ is the cross entropy loss, $G(x)$ is the weighting factor to choose $E_{m}$ .
In this system, each expert learns to specialize on the cases where they perform well, and they are imposed to ignore the cases on which they do not perform well. With this learning paradigm, the experts become a function of a sub-region of the data space, and thus their set of learned weights highly differ from each other as opposed to traditional ensemble models that result to having almost identical weights for their learners.

1.3 Cooperative Ensemble Learning

We refer to the ensemble learning we described in Section 1.1 as traditional ensemble of independent neural networks. However, in our experiments, we trained the ensemble sub-networks concurrently instead of independently or sequentially. In Algorithm 1, we present our modified version of the traditional ensemble, and we call it “cooperative ensemble” (CE) for the rest of this paper.

Input : Dataset

D=\{(x_{i},y_{i})|x_{i}\in\mathbb{R}^{d},y_{i}=1,\ldots,k\}

, randomly initialized networks

f_{1},\ldots,f_{M}

parameterized by

\theta_{1},\ldots,\theta_{M}

Output : Ensemble of

M

trained networks

f_{1},\ldots,f_{M}

1 Initialization;

2 Sample mini-batch

B\subset D

;

3 for $t\leftarrow 0$ to convergence do

4 for $m\leftarrow 1$ to M do

5 # Forward pass: Compute model outputs for mini-batch

\hat{y}_{m,1},\ldots,\hat{y}_{m,B}=f_{m}(x_{B})

8 end for

o=\frac{1}{M}\sum_{m}^{M}\hat{y}_{m}

10 # Backward pass: Update the models

\theta_{m}^{*}=\theta_{m}-\alpha\nabla\ell(y,o)

12 end for

Algorithm 1 Cooperative Ensemble Learning

First, in a training loop, we compute each sub-network output $\hat{y}_{m,B}$ for mini-batches of data $B$ (line 6). Then, similar to a traditional ensemble, we compute the output of this model $o$ by averaging over the individual network outputs (line 8). Finally, we optimize the parameters of each sub-network in the ensemble based on the gradients of the loss between the ensemble network outputs $o$ and the target labels $y$ (line 10).
In contrast, a traditional ensemble of independent networks train each sub-network independently before ensembling, thus not allowing an interaction among the members of the ensemble and not allowing a chance for each member to contribute to the knowledge of one another.
Cooperative ensemble may have already been used in practice in the real world, but we take note of this variant for it presents itself as a more competitive baseline for our experimental model. This is because cooperative ensemble introduces some form of interaction among the sub-networks during training since there is an information feedback from the combination stage to the sub-network weights, thus giving each sub-network a chance to share their knowledge with one another[9].
The contributions of this study are as follows,

1.

The conceptual introduction of cooperative ensembling as a modification to the traditional ensemble of independent networks. The cooperative ensemble is a competitive baseline model for our experimental model (see Section 3).
2.

We introduce an ensemble network that uses a kWTA activation function to combine its sub-network outputs (Section 2). Our approach presents better classification performance on the MNIST, Fashion-MNIST, KMNIST, and Wisconsin Diagnostic Breast Cancer (WDBC) datasets (see Section 3).

2 Competitive Ensemble Learning

We take the cooperative ensembling approach further by introducing a competitive layer as a way to combine the outputs of the sub-networks in the ensemble.
We propose to use a $k$ -Winners-Take-All (kWTA) activation function for a fully connected layer which combines the sub-network outputs in the ensemble, and we call the resulting model “kWTA ensemble neural network” (kWTA-ENN). As per Majani et al. (1989) [10], the kWTA activation function admits $k\geq 1$ winners in a competition among neurons in a hidden layer of a neural network (see Eq. 4 for the kWTA activation function).

\phi_{k}(z)_{j}=\begin{cases}z_{j}&z_{j}\in\{\max\limits_{k}z\}\\ 0&z_{j}\not\in\{\max\limits_{k}z\}\end{cases}

(4)

where $z$ is an activation output, and $k$ is the percentage of winning neurons we want to get. We set $k=0.75$ in all our experiments, but it could still be optimized as it is a hyper-parameter. This kWTA activation function that we used is the classical one[10] as we are only inhibiting the losing neurons in the competition while retaining the values of the winning neurons. Due to competition, the winning neurons gain the right to respond to particular subsets of the input data, as per Rumelhart & Zipser (1985) [13].

Input : Dataset

D=\{(x_{i},y_{i})|x_{i}\in\mathbb{R}^{d},y_{i}=1,\ldots,k\}

, randomly initialized networks

f_{1},\ldots,f_{M}

parameterized by

\theta_{1},\ldots,\theta_{M}

Output : Ensemble of

M

trained networks

f_{1},\ldots,f_{M}

1 Initialization;

2 Sample mini-batch

B\subset D

;

3 for $t\leftarrow 0$ to convergence do

4 for $m\leftarrow 1$ to M do

5 # Forward pass: Compute model outputs for mini-batch

\hat{y}_{m,1},\ldots,\hat{y}_{m,B}=f_{m}(x_{B})

8 end for

\hat{Y}=\hat{y}_{1,B},\ldots,\hat{y}_{M,B}

z=\theta_{z}\hat{Y}+b_{z}

o=\phi_{k}(z)

12 # Backward pass: Update the models

\theta_{m}^{*}=\theta_{m}-\alpha\nabla\ell(y,o)

14 end for

Algorithm 2 k-Winners-Take-All Ensemble Network

We have seen the training algorithm for our cooperative ensemble in Algorithm 1, wherein we train the sub-networks concurrently instead of independently or sequentially. We incorporate the same manner of training in kWTA-ENN, and we lay down our proposed training algorithm in Algorithm 2.
Our model first computes the sub-network outputs $f_{m}(x_{B})$ for each mini-batch of data $B$ (line 6) but as opposed to cooperative ensemble, we do not use a simple averaging of the sub-network outputs. Instead, we concatenate the sub-network outputs $\hat{Y}$ (line 8) and use it as an input to a fully connected layer (line 9). We then pass the fully connected layer output $z$ to the kWTA activation function (line 10). Finally, we update our ensemble based on the gradients of the loss between the kWTA-ENN outputs $o$ and the target labels $y$ (line 12).
To further probe the effect of the kWTA activation function in the combination of sub-network outputs, we add a competition delay parameter $d$ . We define this delay parameter as the number of initial training epochs where the kWTA activation function is not yet used on the fully connected layer output that combines the sub-network outputs. We set $d=0;3;5;7$ .

3 Experiments

To demonstrate the performance gains using our approach, we used four benchmark datasets for evaluation: MNIST [8], Fashion-MNIST[18], KMNIST[2], and WDBC[17]. We ran each model ten times, and we report the average, best, and standard deviation of test accuracies for each of our model. Then, we ran a Kruskal-Wallis H test on the test accuracy results from ten runs of the baseline and experimental models.

Table 1: Dataset statistics.

Dataset	# Samples	Input Dimension	# Classes
MNIST	70,000	784	10
Fashion-MNIST	70,000	784	10
KMNIST	70,000	784	10
WDBC	569	30	2

3.1 Datasets Description

We evaluate and compare our baseline and experimental models on three benchmark image datasets and one benchmark diagnostic dataset. We list the dataset statistics in Table 1.

All the *MNIST datasets consist of 60,000 training examples and 10,00 test examples each – all in grayscale with $28\times 28$ resolution. We flattened each image pixel matrix to a 784-dimensional vector.

•

MNIST. MNIST is a handwritten digit classification dataset[8].
•

Fashion-MNIST. Fashion-MNIST is said to be a more challenging alternative to MNIST that consists of fashion articles from Zalando[18].
•

KMNIST. Kuzushiji-MNIST (KMNIST) is another alternative to the MNIST dataset. Each of its classes represent one character representing each of the 10 rows of Hiragana[2].
•

WDBC. The WDBC dataset is a binary classification dataset where its 30-dimensional features were computed from a digitized image of a fine needle aspirate of a breast mass[17]. It consists of 569 samples where 212 samples are malignant and 357 samples are benign. We randomly over-sampled the minority class in the dataset to account for its imbalanced class frequency distribution, thus increasing the number of samples to 714. We then splitted this dataset to 70% training set and 30% test set.

We randomly picked 10% of the training samples for each of the dataset to serve as the validation dataset.

3.2 Experimental Setup

The code implementations for both our baseline and experimental models are found in https://gitlab.com/afagarap/kwta-ensemble.

3.2.1 Hardware and Software Configuration

We used a laptop computer with an Intel Core i5-6300HQ CPU with Nvidia GTX 960M GPU for training all our models. Then, we used the following arbitrarily chosen 10 random seeds for reproducibility: 42, 1234, 73, 1024, 86400, 31415, 2718, 30, 22, and 17. All our models were implemented in PyTorch 1.8.1 [11] with some additional dependencies listed in the released source code.

Table 2: Classification results on the benchmark datasets (bold values represent the best results) in terms of average, best, and standard deviation of test accuracies (in %). Our kWTA-ENN achieves better test accuracies than our baseline models with statistical significance. * denotes at

p<0.05

ns

denotes not significant.

MNIST
# nets	Acc	kWTA-ENN				MoE	CE
# nets	Acc	d = 0	d = 3	d = 5	d = 7	MoE	CE
2	AVG	98.16	98.18	98.18	98.18	96.43	97.90
	MAX	98.28	98.28	98.28	98.28	96.66	97.96
	STD	0.08	0.08	0.08	0.08	0.25	0.05
	* $H=41.51,\ p=7.39\times 10^{-8}$
3	AVG	98.24	98.26	98.26	98.26	94.67	97.62
	MAX	98.36	98.39	98.39	98.39	96.33	97.71
	STD	0.06	0.08	0.08	0.08	0.99	0.05
	* $H=41.19,\ p=8.61\times 10^{-8}$
4	AVG	98.30	98.27	98.27	98.27	92.349	97.33
	MAX	98.43	98.39	98.39	98.39	95.02	97.39
	STD	0.07	0.08	0.08	0.08	1.30	0.05
	* $H=41.60,\ p=7.11\times 10^{-8}$
5	AVG	98.33	98.34	98.34	98.34	90.63	97.02
	MAX	98.52	98.42	98.42	98.42	91.94	97.13
	STD	0.08	0.05	0.05	0.05	1.25	0.06
	* $H=41.58,\ p=7.17\times 10^{-8}$

Fashion-MNIST
# nets	Acc	kWTA-ENN				MoE	CE
# nets	Acc	d = 0	d = 3	d = 5	d = 7	MoE	CE
2	AVG	87.53	87.54	87.54	87.54	86.59	87.84
	MAX	87.78	87.70	87.70	87.70	87.54	88.00
	STD	0.16	0.12	0.12	0.12	0.40	0.11
	* $H=36.75,\ p=6.72\times 10^{-7}$
3	AVG	87.73	87.81	87.81	87.81	85.54	87.69
	MAX	88.01	88.10	88.10	88.10	87.15	87.86
	STD	0.18	0.15	0.15	0.15	0.58	0.09
	* $H=28.32,\ p=3.16\times 10^{-5}$
4	AVG	87.88	87.93	87.93	87.93	84.47	87.40
	MAX	88.22	88.15	88.15	88.15	86.69	87.54
	STD	0.14	0.15	0.15	0.15	1.20	0.09
	* $H=42.04,\ p=5.78\times 10^{-8}$
5	AVG	87.99	88.06	88.06	88.06	82.89	87.15
	MAX	88.22	88.27	88.27	88.27	85.80	87.27
	STD	0.15	0.15	0.15	0.15	2.18	0.05
	* $H=42.26,\ p=5.22\times 10^{-8}$

KMNIST
# nets	Acc	kWTA-ENN				MoE	CE
# nets	Acc	d = 0	d = 3	d = 5	d = 7	MoE	CE
2	AVG	90.64	90.53	90.53	90.53	85.23	89.94
	MAX	91.11	90.74	90.74	90.74	87.14	90.19
	STD	0.29	0.12	0.12	0.12	0.99	0.12
	* $H=41.63,\ p=6.99\times 10^{-8}$
3	AVG	91.16	91.17	91.17	91.17	81.12	89.47
	MAX	91.4	91.51	91.51	91.51	87.59	89.61
	STD	0.14	0.19	0.19	0.19	2.77	0.12
	* $H=41.09,\ p=9.00\times 10^{-8}$
4	AVG	91.39	91.31	91.31	91.31	77.55	88.72
	MAX	91.68	91.54	91.54	91.54	83.04	88.94
	STD	0.18	0.15	0.15	0.15	2.89	0.13
	* $H=41.67,\ p=6.88\times 10^{-8}$
5	AVG	91.56	91.52	91.52	91.52	74.17	87.87
	MAX	91.82	91.76	91.76	91.76	79.99	88.02
	STD	0.16	0.18	0.18	0.18	3.47	0.09
	* $H=41.28,\ p=8.24\times 10^{-8}$

WDBC
# nets	Acc	kWTA-ENN				MoE	CE
# nets	Acc	d = 0	d = 3	d = 5	d = 7	MoE	CE
2	AVG	95.43	95.36	95.36	95.36	94.49	95.79
	MAX	98.62	98.62	98.62	98.62	98.57	99.05
	STD	1.98	2.48	2.48	2.48	2.37	2.13
	(ns) $H=1.40,\ p=9.24\times 10^{-1}$
3	AVG	94.76	95.64	95.64	95.64	92.68	95.35
	MAX	98.15	99.07	99.07	99.07	95.45	98.17
	STD	1.92	2.33	2.33	2.33	2.36	2.45
	(ns) $H=9.20,\ p=1.02\times 10^{-1}$
4	AVG	94.98	95.97	95.97	95.97	91.79	95.65
	MAX	98.62	98.62	98.62	98.62	96.67	98.15
	STD	2.87	2.20	2.20	2.20	4.20	2.00
	* $H=12.56,\ p=2.78\times 10^{-2}$
5	AVG	95.03	95.40	95.40	95.40	90.93	95.04
	MAX	98.61	99.05	99.05	99.05	96.33	98.61
	STD	2.73	2.83	2.83	2.83	2.60	2.47
	* $H=12.16,\ p=3.27\times 10^{-2}$

3.2.2 Training Details

For all our models, we used a feed-forward neural network with one hidden layer having 100 neurons as the sub-network, and then we vary the number of sub-networks per model from 2 to 5. The sub-network weights were initialized with Kaiming uniform initializer [7].
We trained our baseline and experimental models on the MNIST, Fashion-MNIST, and KMNIST datasets using mini-batch stochastic gradient descent (SGD) with momentum[12] of $9\times 10^{-1}$ , a learning rate of $1\times 10^{-1}$ decaying to $1\times 10^{-4}$ , and weight decay of $1\times 10^{-5}$ on a batch size of 100 for 10,800 iterations (equivalent to 20 epochs). As for the WDBC dataset, we used the same hyper-parameters except we trained our models for only 249 iterations (equivalent to 20 epochs). All these hyper-parameters were arbitrarily chosen since we did not perform hyper-parameter tuning for any of our models. This makes the comparison fair for our baseline and experimental models, and we also did not have the computational resources to do so, which is why we chose a simple architecture as the sub-network.
We recorded the accuracy and loss during both the training and validation phases. We then used the validation accuracy as the basis to checkpoint the best model parameters $\theta$ so far in the training. By the end of each training epoch, we load the best recorded parameters to be used by the model at test time.

3.3 Classification Performance

We evaluate the performance of our proposed approach in its different configurations as per the competition delay parameter $d$ and compare it with our baseline models: Mixture-of-Experts (MoE) and Cooperative Ensemble (CE). The empirical evidence shows our proposed approach outperforms our baseline models on the benchmark datasets we used. However, we are not able to observe a proper trend in performance with respect to the varying values of $d$ , and thus it may warrant further investigation.
For the full classification performance results of our baseline and experimental models, we refer the reader to Table 2, from where we can observe the following:

1.

MoE performed the least among the models in our experiments, which may be justified with our choice of mini-batch size of 100. MoE performs better on larger datasets and/or larger batch sizes[5, 16].
2.

CE is indeed a competitive baseline as we can see from the performance margins when compared to our proposed model.
3.

Our model in its different variations has consistently outperformed our baseline models in terms of average test accuracy (with the exception of two sub-networks for Fashion-MNIST and WDBC).
4.

Our model has higher margins on its improved test accuracy on the KMNIST dataset, which we find appealing since the said dataset is also supposed to be more difficult than the MNIST dataset and thus it better demonstrates the performance gains using our model.
5.

Finally, we can observe that there is a statistical significance among the differences in performance of the baseline and experimental models at $p<0.05$ (on WDBC, for $M=4,5$ sub-networks), which indicates that the performance gains through our proposed approach are statistically significant.

3.4 Improving cooperation through competitive learning

In the context of our work, we refer to cooperation in a group of neural networks as the phenomenon when the members of the group contribute to the overall group performance. For instance, in CE, all the sub-networks contribute to the knowledge of one another as opposed to the traditional ensemble, where there is no interaction among the ensemble members[9]. Meanwhile, specialization is when members of a group of neural networks are tasked to a specific subset of the input data, which is the intention behind the design of MoE[5]. In this respect, competition can be thought of leading to specialization since it is when the winning units gain the right to respond to a particular subset of the dataset. We argue that with our proposed approach, we employ the notion of all three: competition, specialization, and cooperation.

Table 3: Classification results of each kWTA-ENN sub-network and kWTA-ENN itself on MNIST (3(a)) and KMNIST (3(b)) datasets. The tables show the test accuracy of each sub-network on each dataset class, indicating a degree of specialization among the sub-networks. Furthermore, the final model accuracy on each class shows that combining the sub-network outputs have stronger predictive capability. These divisions were in no way pre-determined but they show how cooperation by specialization can be done through competitive ensemble.

(a)

	n = 2			n = 3				n = 4					n = 5
0	93.65	93.07	98.38	96.44	90.38	94.49	98.78	71.55	84.36	99.56	69.15	98.68	96.10	93.11	85.83	94.91	75.07	99.29
1	95.97	85.42	99.03	65.52	96.13	75.91	99.30	54.44	91.07	100.00	52.46	99.47	87.96	71.03	NaN	58.35	71.60	99.03
2	84.15	87.29	98.55	62.92	94.89	83.01	98.26	82.80	74.53	35.96	77.51	97.88	75.66	56.49	37.32	80.63	95.86	98.44
3	80.95	55.70	97.64	45.22	90.98	91.16	98.42	36.96	63.28	70.62	62.89	98.32	38.42	84.77	37.93	44.11	82.18	97.85
4	87.01	88.22	98.47	53.23	87.55	72.75	98.27	76.58	42.58	68.86	80.73	98.07	53.01	81.43	62.47	82.61	51.95	97.97
5	88.11	85.79	98.87	92.61	69.26	67.32	98.65	59.88	56.59	68.49	70.17	98.87	40.99	78.57	50.75	25.00	53.43	99.20
6	97.50	93.12	98.43	83.32	93.76	90.21	98.85	91.64	50.00	78.41	82.21	98.53	84.51	76.62	90.84	93.66	74.83	98.54
7	74.34	82.96	98.06	85.11	74.32	90.80	98.25	60.91	72.37	57.07	48.87	98.64	76.30	78.04	76.23	72.87	50.41	98.73
8	70.23	93.52	97.44	57.64	77.35	72.58	97.44	70.26	50.69	51.66	54.81	97.64	30.41	34.74	62.99	33.42	74.02	97.64
9	95.61	79.95	97.80	51.96	72.86	45.49	97.42	59.16	32.77	81.57	74.63	98.01	49.32	88.89	34.84	59.78	93.09	97.35
	Net-1	Net-2	Final	Net-1	Net-2	Net-3	Final	Net-1	Net-2	Net-3	Net-4	Final	Net-1	Net-2	Net-3	Net-4	Net-5	Final

(b)

	n = 2			n = 3				n = 4					n = 5
o	76.26	92.15	92.61	66.90	77.87	57.51	93.36	75.75	86.90	81.82	47.54	93.47	89.81	89.66	57.80	66.26	46.32	93.89
ki	73.97	84.29	91.93	36.16	66.43	64.12	90.91	50.00	44.32	42.90	58.70	90.62	32.40	44.29	42.84	59.19	33.91	91.58
su	48.96	57.00	86.77	64.27	63.46	43.32	86.82	30.37	44.03	48.54	46.96	86.40	44.73	35.02	34.72	49.94	29.66	85.88
tsu	62.89	68.20	90.31	81.94	72.38	79.35	92.96	56.40	49.27	58.38	74.32	93.52	70.08	51.31	81.44	81.71	53.56	92.85
na	67.75	58.87	88.86	61.84	76.86	51.20	90.96	54.63	78.43	44.91	72.18	90.45	66.21	50.35	39.13	48.31	54.84	91.38
ha	78.82	85.22	95.81	70.00	85.87	85.66	96.11	50.40	60.07	92.27	63.64	95.80	52.29	57.13	81.25	53.85	78.43	95.51
ma	79.54	73.49	88.42	47.10	61.70	65.72	86.67	77.54	43.15	69.14	55.65	88.95	41.11	56.41	48.51	40.66	37.87	87.75
ya	69.87	57.35	93.95	78.68	78.20	78.64	94.36	68.37	56.80	70.08	73.91	94.68	55.13	54.11	58.63	70.67	57.61	94.62
re	72.54	83.40	90.61	79.91	58.84	63.71	90.89	40.09	78.33	70.06	44.14	89.86	45.38	49.66	42.62	43.62	49.06	91.00
wo	68.23	71.51	94.10	76.15	40.68	56.52	92.71	65.73	81.18	36.86	63.54	92.88	58.00	75.00	55.14	48.84	68.35	94.15
	Net-1	Net-2	Final	Net-1	Net-2	Net-3	Final	Net-1	Net-2	Net-3	Net-4	Final	Net-1	Net-2	Net-3	Net-4	Net-5	Final

kWTA-ENN uses a kWTA activation function so that the neurons from its sub-networks could compete for their right to respond to a particular subset of the dataset. We demonstrate this in Figures 1 and 2. Let us recall that kWTA-ENN gets its outputs by computing a linear combination of the outputs of its sub-networks, and then passing the linear combination results to a kWTA activation function. As per the referred figures, even though each kWTA-ENN sub-network is not providing high probability output per class as compared to MoE and CE sub-networks, the final kWTA-ENN output is on par with the MoE and CE probability outputs.
We can then infer two things from this: (1) the kWTA activation function inhibits the neurons of the losing kWTA-ENN sub-networks, and (2) the probability outputs of the winning sub-network neurons enable the sub-networks to help one another. For instance, in Figure 1(c), we can observe a probability output for class 1 from sub-network 1, however minimal, and a higher probability output for class 1 from sub-networks 2 and 3, but then their final output has even higher probability output for the same class when compared to MoE and CE probability outputs. The same could be observed in Figure 2(c). This is because the losing neurons are inhibited in the competition process while retaining the winner neurons, thus improving the final probability output of the model.
In Table 3(b), we further support this by showing the per-class accuracy of each kWTA-ENN sub-network with varying number of sub-networks. We can see that there is some apparent division of classes among the sub-networks even without pre-defining such divisions, but the final per-class accuracies of the entire model are even better than the per-class accuracies of the sub-networks, thus suggesting that there is indeed a sharing of responsibility among the sub-networks due to the inhibition of losing sub-network neurons and retention of the winning sub-network neurons, even with the competition in place.

4 Conclusion and Future Works

We introduce the k-Winners-Take-All ensemble neural network (kWTA-ENN) which uses a kWTA activation function as the means to combine the sub-network outputs in an ensemble as opposed to the conventional way of combining sub-network outputs through averaging, summation, or voting. Using a kWTA activation function induces competition among the sub-network neurons in an ensemble. This in turn leads to some form of specialization among them, thereby improving the overall performance of the ensemble.
Our comparative results showed that our proposed approach outperforms our baseline models, yielding the following test accuracies on benchmark datasets: 98.34% on MNIST, 88.06% on Fashion-MNIST, 91.56% on KMNIST, and 95.97% on WDBC. We intend to pursue further exploration into this subject by comparing the performance of our baseline and experimental models with respect to varying mini-batch sizes, by training on other benchmark datasets, and finally, by using a more rigorous statistical treatment for a more formal comparison between our proposed model and our baseline models.

References

[1] Breiman, Leo. “Stacked regressions.” Machine Learning 24.1 (1996): 49-64.
[2] Clanuwat, Tarin, et al. “Deep Learning for Classical Japanese Literature.” arXiv preprint arXiv:1812.01718 (2018).
[3] Freund, Yoav, and Robert E. Schapire. “Experiments with a New Boosting Algorithm.” ICML. Vol. 96. 1996.
[4] Hansen, Lars Kai, and Peter Salamon. “Neural network ensembles.” IEEE Transactions on Pattern Analysis and Machine Intelligence 12.10 (1990): 993-1001.
[5] Jacobs, Robert A., et al. “Adaptive Mixtures of Local Experts.” Neural Computation 3.1 (1991): 79-87.
[6] Jordan, Michael I., and Robert A. Jacobs. “Hierarchies of adaptive experts.” Advances in Neural Information Processing Systems. 1992.
[7] He, Kaiming, et al. “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Proceedings of the IEEE international conference on computer vision. 2015.
[8] LeCun, Yann. “The MNIST database of handwritten digits.” http://yann.lecun.com/exdb/mnist/ (1998).
[9] Liu, Yong, and Xin Yao. “A cooperative ensemble learning system.” 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No. 98CH36227). Vol. 3. IEEE, 1998.
[10] Majani, E., Ruth Erlanson, and Yaser Abu-Mostafa. “On the K-winners-take-all network.” (1989): 634-642.
[11] Paszke, Adam, et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” Advances in Neural Information Processing Systems 32 (2019): 8026-8037.
[12] Qian, Ning. “On the momentum term in gradient descent learning algorithms.” Neural Networks 12.1 (1999): 145-151.
[13] Rumelhart, David E., and David Zipser. “Feature Discovery by Competitive Learning.” Cognitive Science 9.1 (1985): 75-112.
[14] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors.” nature 323.6088 (1986): 533-536.
[15] Schapire, Robert E. “The strength of weak learnability.” Machine learning 5.2 (1990): 197-227.
[16] Shazeer, Noam, et al. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.” arXiv preprint arXiv:1701.06538 (2017).
[17] Wolberg, William H., W. Nick Street, and Olvi L. Mangasarian. “Breast cancer Wisconsin (diagnostic) data set.” UCI Machine Learning Repository [http://archive. ics. uci. edu/ml/] (1992).
[18] Xiao, Han, Kashif Rasul, and Roland Vollgraf. “Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms.” arXiv preprint arXiv:1708.07747 (2017).