Ensemble of Binary Classifiers Combined Using Recurrent Correlation Associative Memories

Rodolfo Anibal Lobo and Marcos Eduardo Valle Institute of Mathematics, Statistics, and Scientific Computing
University of Campinas
Campinas, Brazil
[email protected], [email protected]

Abstract

An ensemble method should cleverly combine a group of base classifiers to yield an improved classifier. The majority vote is an example of a methodology used to combine classifiers in an ensemble method. In this paper, we propose to combine classifiers using an associative memory model. Precisely, we introduce ensemble methods based on recurrent correlation associative memories (RCAMs) for binary classification problems. We show that an RCAM-based ensemble classifier can be viewed as a majority vote classifier whose weights depend on the similarity between the base classifiers and the resulting ensemble method. More precisely, the RCAM-based ensemble combines the classifiers using a recurrent consult and vote scheme. Furthermore, computational experiments confirm the potential application of the RCAM-based ensemble method for binary classification problems.

keywords:

Binary classification , ensemble method , associative memory , recurrent neural network , random forest.

1 Introduction

Inspired by the idea that multiple opinions are crucial before making a final decision, ensemble methods make predictions by consulting multiple different predictors [1]. Apart from their similarity with some natural decision-making methodologies, ensemble methods have a strong statistical background. Namely, ensemble methods aim to reduce the variance – thus increasing the accuracy – by combining multiple different predictors. Due to their versatility and effectiveness, ensemble methods have been successfully applied to a wide range of problems including classification, regression, and feature selection. As a preliminary study, this paper only addresses ensemble methods for binary classification problems.

Although there is no rigorous definition of an ensemble classifier [2], they can be conceived as a group of base classifiers, also called weak or base classifiers. As to the construction of an ensemble classifier, we must take into account the diversity of the base classifiers and the rule used to combine them [2, 3]. There are a plethora of ensemble methods in the literature, including bagging, pasting, random subspace, boosting, and stacking [4, 5, 6, 7]. For example, a bagging ensemble classifier is obtained by training copies of a single base classifier using different subsets of the training set [4]. Similarly, a random subspace classifier is obtained by training copies of a classifier using different subsets of features [5]. In both bagging and random subspace ensembles, the base classifiers are then combined using a voting scheme. Random forest is a successful example of an ensemble of decision tree classifiers trained using both bagging and random subspace ensemble ideas [8].

In contrast to the traditional majority voting, in this paper, we propose to combine the base classifiers using an associative memory. Associative memories (AMs) refer to a broad class of mathematical models inspired by the human brain’s ability to store and recall information by association [9, 10, 11]. The Hopfield neural network is a typical example of a recurrent neural network able to implement an associative memory [12]. Despite its many successful applications [13, 14, 15, 16], the Hopfield neural network suffers from an extremely low storage capacity as an associative memory model [17]. To overcome the low storage capacity of the Hopfield network, many prominent researchers proposed alternative learning schemes [18, 19] as well as improved network architectures. In particular, the recurrent correlation associative memories (RCAMs), proposed by Chiueh and Goodman [20], can be viewed as a kernelized version of the Hopfield neural network [21, 22, 23]. In this paper, we apply the RCAMs to combine binary classifiers in an ensemble method.

At this point, we would like to remark that associative memories have been previously used by Kultur et al. to improve the performance of an ensemble method [24]. Apart from addressing a regression problem, Kultur et al. use an associative memory in parallel to an ensemble of multi-layer perceptrons. The resulting model is called ensemble of neural networks with associative memory (ENNA). Our approach, in contrast, uses an associative memory to combine the base classifiers. Besides, Kultur et al. associate patterns using the k-nearest neighbor algorithm which is formally a non-parametric method used for classification or regression. Differently, we use recurrent correlation associative memories, which are models conceived to implement associative memories.

The paper is organized as follows: The next section reviews the recurrent correlation associative memories. Ensemble methods are presented in Section 3. The main contribution of the manuscript, namely the ensemble classifiers based on associative memories, are addressed in Section 3.2. Section 4 provides some computational experiments. The paper finishes with some concluding remarks in Section 5.

2 A Brief Review on Recurrent Correlation Associative Memories

Recurrent correlation associative memories (RCAMs) has been introduced by Chiueh and Goodman as an improved version of the famous correlation-based Hopfield neural network [20, 12].

Briefly, an RCAM is obtained by decomposing the Hopfield network with Hebbian learning into a two-layer recurrent neural network. The first layer computes the inner product (correlation) between the input and the memorized items followed by the evaluation of a non-decreasing continuous activation function. The subsequent layer yields a weighted average of the stored items.

In mathematical terms, a RCAM is defined as follows: Let $\mathbb{B}=\{-1,+1\}$ and $f:[-1,+1]\to\mathbb{R}$ be a continuous non-decreasing real-valued function. Given a fundamental memory set $\mathcal{U}=\{\boldsymbol{u}^{1},\ldots,\boldsymbol{u}^{P}\}\subset\mathbb{B}^{N}$ , the neurons in the first layer of a bipolar RCAM yield

w_{\xi}(t)=f\left(\frac{1}{N}\sum_{i=1}^{N}z_{i}(t)u_{i}^{\xi}\right),\quad\forall\xi\in 1,\ldots,P,

(1)

where $\boldsymbol{z}(t)=[z_{1}(t),z_{2}(t),\ldots,z_{N}(t)]^{T}\in\mathbb{B}^{N}$ denotes the current state of the network and $\boldsymbol{u}^{\xi}=[u_{1}^{\xi},\ldots,u_{N}^{\xi}]^{T}$ is the $\xi$ th fundamental memory. The activation potential of the output neuron $a_{i}(t)$ is given by the following weighted sum of the memory items:

a_{i}(t)=\sum_{\xi=1}^{P}w_{\xi}(t)u_{i}^{\xi},\quad\forall i=1,\ldots,N.

(2)

Finally, the state of the $i$ th neuron of the RCAM is updated as follows for all $i=1,\ldots,N$ :

z_{i}(t+1)=\begin{cases}\text{sgn}\big{(}a_{i}(t)\big{)}&a_{i}(t)\neq 0,\\ z_{i}(t),&\mbox{otherwise}.\end{cases}

(3)

From (2), we refer to $w_{\xi}(t)$ as the weight associated to the $\xi$ th memory item.

In contrast to the Hopfield neural network, the sequence $\{\boldsymbol{z}(t)\}_{t\geq 0}$ produced by an RCAM is convergent in both synchronous and asynchronous update modes independently of the number of fundamental memories and the initial state vector $\boldsymbol{z}(0)$ [20]. In other words, the limit $\boldsymbol{y}=\lim_{t\to\infty}\boldsymbol{z}(t+1)$ of the sequence given by (3) is well defined using either synchronous or asynchronous update.

As an associative memory model, an RCAM designed for the storage and recall of the vectors $\boldsymbol{u}^{1},\ldots,\boldsymbol{u}^{P}$ proceeds as follows: Given a stimulus (initial state) $\boldsymbol{z}(0)$ , the vector recalled by the RCAM is $\boldsymbol{y}=\lim_{t\to\infty}\boldsymbol{z}(t+1)$ .

Finally, the function $f$ defines different RCAM models. For example:

1.

The correlation RCAM or identity RCAM is obtained by considering in (1) the identity function $f_{i}(x)=x$ .
2.

The exponential RCAM, which is determined by

$f_{e}(x;\alpha)=e^{\alpha x},\quad\alpha>0.$ (4)

The identity RCAM corresponds to the traditional Hopfield network with Hebbian learning and self-feedback. Different from the Hopfield network and the identity RCAM, the storage capacity of the exponential RCAM scales exponentially with the dimension of the memory space. Apart from the high storage capacity, the exponential RCAM can be easily implemented on very large scale integration (VLSI) devices [20]. Furthermore, the exponential RCAM allows for a Bayesian interpretation [25] and it is closely related to support vector machines and the kernel trick [21, 22, 23]. In this paper, we focus on the exponential RCAM, formerly known as exponential correlation associative memory (ECAM).

3 Ensemble of Binary Classifiers

An ensemble classifier combines a group of single classifiers, also called weak or base classifiers, in order to provide better classification accuracy than a single one [1, 6, 2]. Although this approach is partially inspired by the idea that multiple opinions are crucial before making a final decision, ensemble classifiers have a strong statistical background. Namely, ensemble classifiers reduce the variance combining the base classifiers. Furthermore, when the amount of training data available is too small compared to the size of the hypothesis space, the ensemble classifier “ mixes” the base classifiers reducing the risk of choosing the wrong single classifier [26].

Formally, let $\mathcal{T}=\{(\boldsymbol{t}_{1},d_{1}),\dots,(\boldsymbol{t}_{M},d_{M})\}$ be a training set where $\boldsymbol{t}_{i}\in\mathcal{X}$ and $d_{i}\in\mathcal{C}$ are respectively the feature sample and the class label of the $i$ th training pair. Here, $\mathcal{X}$ denotes the feature space and $\mathcal{C}$ represents the set of all class labels. In a binary classification problem, we can identify $\mathcal{C}$ with $\mathbb{B}=\{-1,+1\}$ . Moreover, let $h_{1},h_{2},\ldots,h_{P}:\mathcal{X}\to\mathcal{C}$ be base classifiers trained using the whole or part of the training set $\mathcal{T}$ .

Usually, the base classifiers are chosen according to their accuracy and diversity. On the one hand, an accurate classifier is one that has an error rate better than random guessing on new instances. On the other hand, two classifiers are diverse if they make different errors on new instances [27, 26].

Bagging and random subspace ensembles are examples of techniques that can be used to ensure the diversity of the base classifiers. The idea of bagging, an acronym for Bootstrap AGGregatING, is to train copies of a certain classifier $h$ on subsets of the training set $\mathcal{T}$ [4]. The subsets are obtained by sampling the training $\mathcal{T}$ with replacement, a methodology known as bootstrap sampling [2]. In a similar fashion, random subspace ensembles are obtained by training copies of a certain classifier $h$ using different subsets of the feature space [5]. Random forest, which is defined as an ensemble of decision tree classifiers, is an example of an ensemble classifier that combines both bagging and random subspace techniques [8].

Another important issue that must be addressed in the design of an ensemble classifier is how to combine the base classifiers. In the following, we review the majority voting methodology – one of the oldest and widely used combination scheme. The methodology based on associative memories is introduced and discussed subsequently.

3.1 Majority Voting Classifier

As remarked by Kuncheva [2], majority voting is one of the oldest strategies for decision making. In a wide sense, a majority voting classifier yields the class label with the highest number of occurrences among the base classifiers [28, 7].

Formally, let $h_{1},h_{2},\ldots,h_{P}:\mathcal{X}\to\mathcal{C}$ be the base classifiers. The majority voting classifier, also called hard voting classifier and denoted by $H_{v}:\mathcal{X}\to\mathcal{C}$ , is defined by means of the equation

H_{v}(\boldsymbol{x})=\underset{c\in\mathcal{C}}{\operatorname{argmax}}\,\sum_{\xi=1}^{P}w_{\xi}\mathcal{I}[h_{\xi}(\boldsymbol{x})=c],\quad\forall\boldsymbol{x}\in\mathcal{X},

(5)

where $w_{1},\ldots,w_{P}$ are the weights of the base classifiers and $\mathcal{I}$ is the indicator function, that is,

\mathcal{I}[h_{\xi}(\boldsymbol{x})=c]=\begin{cases}1,&h_{\xi}(\boldsymbol{x})=c,\\ 0,&\mbox{otherwise}.\end{cases}

(6)

When $\mathcal{C}=\{-1,+1\}$ , the majority voting ensemble classifier given by (5) can be written alternatively as

H_{h}(\boldsymbol{x})=\text{sgn}\left(\sum_{\xi=1}^{P}w_{\xi}h_{\xi}(\boldsymbol{x})\right),\quad\forall\boldsymbol{x}\in\mathcal{X},

(7)

whenever $\sum_{\xi=1}^{P}w_{\xi}h_{\xi}(\boldsymbol{x})\neq 0$ [29].

3.2 Ensemble Based on Bipolar Associative Memories

Let us now introduce the ensemble classifiers based on the RCAM models. In analogy to the majority voting ensemble classifier, the RCAM-based ensemble classifier is formulated using only the base classifiers $h_{1},\ldots,h_{P}:\mathcal{X}\to\mathbb{B}$ . Precisely, consider a training set $\mathcal{T}=\{(\boldsymbol{t}_{i},d_{i}):i=1,\ldots,M\}\subset\mathcal{X}\times\mathbb{B}$ and let $X=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{L}\}\subset\mathcal{X}$ be a batch of input samples. We first define the fundamental memories as follows for all $\xi=1,\ldots,P$ :

\boldsymbol{u}^{\xi}=[h_{\xi}(\boldsymbol{t}_{1}),\ldots,{h_{\xi}(\boldsymbol{t}_{M})},h_{\xi}(\boldsymbol{x}_{1}),\ldots,h_{\xi}(\boldsymbol{x}_{L})]^{T}\in\mathbb{B}^{M+L}.

(8)

In words, the $\xi$ th fundamental memory is obtained by concatenating the outputs of the $\xi$ th base classifier evaluated at the $M$ training samples and the $L$ input samples. The bipolar RCAM is synthesized using the fundamental memory set $\mathcal{U}=\{\boldsymbol{u}^{1},\ldots,\boldsymbol{u}^{P}\}$ and it is initialized at the state vector

\boldsymbol{z}(0)=[d_{1},d_{2},\ldots,d_{M},\underbrace{0,0,\ldots,0}_{L-\mbox{\small times}}]^{T}.

(9)

Note that the first $M$ components of initial state $\boldsymbol{z}(0)$ correspond to the targets in the training set $\mathcal{T}$ . The last $L$ components of $\boldsymbol{z}(0)$ are zero, a neutral element different from the class labels. The inital state $\boldsymbol{z}(0)$ is presented as input to the associative memory and the last $L$ components of the recalled vector $\boldsymbol{y}$ yield the class label of the batch of input samples $X=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{L}\}$ . In mathematical terms, the RCAM-based ensemble classifier $H_{a}:\mathcal{X}\to\mathbb{B}$ is defined by means of the equation

\displaystyle H_{a}(\boldsymbol{x}_{i})=y_{M+i},\quad\forall\boldsymbol{x}_{i}\in X,

(10)

where $\boldsymbol{y}=[y_{1},\ldots,y_{M},y_{M+1},\ldots,y_{M+L}]^{T}$ is the limit of the sequence $\{\boldsymbol{z}(t)\}_{t\geq 0}$ given by (3).

In the following, we point out the relationship between the bipolar RCAM-based ensemble classifier and the majority voting ensemble described by (7). Let $\boldsymbol{y}$ be the vector recalled by the RCAM fed by the input $\boldsymbol{z}(0)$ given by (9), that is, $\boldsymbol{y}$ is a stationary state of the RCAM. From (2), (3), and (8), the output of the RCAM-based ensemble classifier satisfies

H_{a}(\boldsymbol{x}_{i})=\text{sgn}\left(\sum_{\xi=1}^{P}w_{\xi}h_{\xi}(\boldsymbol{x}_{i})\right),

(11)

where

w_{\xi}=f\left(\frac{1}{M+L}\sum_{i=1}^{M+L}y_{i}u_{i}^{\xi}\right),\quad\forall\xi=1,\ldots,P.

(12)

From (11), the bipolar RCAM-based ensemble classifier can be viewed as a weighted majority voting classifier. Furthermore, the weight $w_{\xi}$ depends on the similarity between the $\xi$ th base classifier $h_{\xi}$ and the ensemble classifier $H_{a}$ . Precisely, let us define the similarity between two binary classifiers $H,h_{\xi}:\mathcal{X}\to\mathbb{B}$ on a set of samples $S$ by means of the equation

\mathtt{Sim}(H,h)=\frac{1}{\mbox{Card}(S)}\sum_{\boldsymbol{s}\in S}\mathcal{I}\big{[}h(\boldsymbol{s})=H(\boldsymbol{s})\big{]}.

(13)

Using (13), we can state the following theorem:

Theorem 1.

The weights of the RCAM-based ensemble classifier given by (11) satisfies the following identities for all $\xi=1,\ldots,P$ :

w_{\xi}=f\big{(}1-2\cdot\mathtt{Sim}(H_{a},h_{\xi})\big{)},\quad\forall t\geq 1,

(14)

where the similarity in (14) is evaluated on the union of all training and input samples, that is, on $S=X\cup T=\{\boldsymbol{t}_{1},\ldots,\boldsymbol{t}_{M}\}\cup\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{L}\}$ .

Proof.

Since we are considering a binary classification problem, the similarity between the ensemble $H_{a}$ and the base classifier $h_{\xi}$ on $S=X\cup T$ , with $N=\text{Card}(S)=M+L$ , satisfies the following identities:

	$\displaystyle\mathtt{Sim}(H,h)$	$\displaystyle=1-\frac{1}{N}\sum_{i=1}^{N}\mathcal{I}[h(\boldsymbol{s}_{i})\neq H_{a}(\boldsymbol{s}_{i})]=1-\frac{1}{4N}\sum_{i=1}^{N}\big{(}h(\boldsymbol{s}_{i})-H_{a}(\boldsymbol{s}_{i})\big{)}^{2}$
		$\displaystyle=1-\frac{1}{2N}\sum_{i=1}^{N}\big{(}1-H_{a}(\boldsymbol{s}_{i})h(\boldsymbol{s}_{i})\big{)}=\frac{1}{2}\left(1-\frac{1}{N}\sum_{i=1}^{N}H_{a}(\boldsymbol{s}_{i})h(\boldsymbol{s}_{i})\right)$

Equivalently, we have

\frac{1}{\mbox{Card}(S)}\sum_{\boldsymbol{s}\in S}H(\boldsymbol{s})h(\boldsymbol{s})=1-2\cdot\mathtt{Sim}(H,h).

(15)

Now, from (1), (10), and (15), we obtain the following identities:

\displaystyle w_{\xi}

\displaystyle=f\left(\frac{1}{N}\sum_{i=1}^{N}y_{i}u_{i}^{\xi}\right)=f\big{(}1-2\cdot\mathtt{Sim}(H_{a},h_{\xi})\big{)},

which concludes the proof. ∎

Theorem 1 shows that the RCAM-based ensemble classifier is a majority voting classifier whose weights depend on the similarity between the base classifiers and the ensemble itself. In fact, in view of the dynamic nature of the RCAM model, $H_{a}$ is obtained by a recurrent consult and vote scheme. Moreover, at the first step, the weights depend on the accuracy of the base classifiers.

4 Computational Experiments

In this section, we perform some computational experiments to evaluate the performance of the proposed RCAM-based ensemble classifiers for binary classification tasks. Precisely, we considered the RCAM-based ensembles obtained using the identity and the exponential as the activation function $f$ . The parameter $\alpha$ of the exponential activation function has been either set to $\alpha=1$ or it has been determined using a grid search on the set $\{10^{-2},10^{-1},0.5,1,5,10,20,50\}$ with 5-fold cross-validation on the training set. The RCAM-based ensemble classifiers have been compared with AdaBoost, gradient boosting, and random forest ensemble classifiers, all available at the python’s scikit-learn API (sklearn) [30].

First of all, we trained AdaBoost and gradient boosting ensemble classifiers using the default parameters of sklearn. Recall that boosting ensemble classifiers are developed incrementally by adding base classifiers to reduce the number of misclassified samples [2]. Also, we trained the random forest classifier with 30 base classifiers ( $P=30$ ) [8]. Recall that the base classifiers of the random forest are decision trees obtained using bagging and random subspace techniques [4, 5]. Then, we used the base classifiers from the trained random forest ensemble to define the RCAM-based ensemble. In other words, the same base classifiers $h_{1},\ldots,h_{30}$ are used in the random forest and the RCAM-based classifiers. The difference between the ensemble classifiers resides in the combining rule. Recall that the random forest combines the base classifiers using majority voting. From the computational point of view, training the random forest and the RCAM-ensemble classifiers required similar resources. Moreover, despite the consult and vote scheme of the RCAM-based ensemble, they have not been significantly more expensive than the random forest classifier. The grid search used to fine-tune the parameter $\alpha$ of the exponential RCAM-based ensemble is the major computational burden in this computational experiment.

For the comparison of the ensemble classifiers, we considered 28 binary classification problems from the OpenML repository [31]. These binary classification problems can be obtained using the command fetch_openml from sklearn. We would like to point out that missing data has been handled before splitting the data set into training and test sets using the command SimpleImputer from sklearn. Also, we pre-processed the data using the StandardScaler transform. Therefore, each feature is normalized by subtracting the mean and dividing by the standard deviation, both computed using only the training set. Furthermore, since some data sets are unbalanced, we used the F-measure to evaluate quantitatively the performance of a certain classifier. Table 1 shows the mean and the standard deviation of the F-measure obtained from the ensemble classifiers using stratified 10-fold cross-validation. The largest F-measures for each data set have been typed using boldface.

Table 1: Mean and standard deviation of the F-measures produced by ensemble classifiers using stratified 10-fold cross-validation.

		Gradient	Random	Identity	Exponential	Exp. RCAM
Data set	AdaBoost	Boosting	Forest	RCNN	RCAM	+ Grid Search
Arsene	$84.0\pm 5.9$	$\mathbf{86.2\pm 7.6}$	$81.5\pm 8.9$	$83.8\pm 8.4$	$83.8\pm 8.4$	$85.2\pm 10.2$
Australian	$82.1\pm 3.4$	$\mathbf{85.8\pm 3.8}$	$85.4\pm 3.4$	$85.3\pm 2.9$	$85.3\pm 2.9$	$85.0\pm 2.9$
Banana	$67.9\pm 2.1$	$88.1\pm 1.6$	$88.0\pm 1.3$	$\mathbf{88.2\pm 1.2}$	$88.2\pm 1.2$	$87.2\pm 1.2$
Banknote	$\mathbf{99.6\pm 0.4}$	$99.5\pm 0.9$	$99.3\pm 0.7$	$99.2\pm 0.7$	$99.2\pm 0.7$	$98.9\pm 0.9$
Blood Transfusion	$\mathbf{43.0\pm 13.1}$	$37.9\pm 11.2$	$32.3\pm 10.4$	$33.3\pm 10.6$	$33.3\pm 10.6$	$32.5\pm 8.2$
Breast Cancer Wisconsin	$94.7\pm 2.0$	$95.2\pm 2.4$	$94.9\pm 3.4$	$\mathbf{95.4\pm 2.9}$	$95.1\pm 3.3$	$95.2\pm 4.2$
Chess	$96.5\pm 1.1$	$97.9\pm 0.8$	$99.0\pm 0.5$	$99.0\pm 0.6$	$99.0\pm 0.6$	$\mathbf{99.2\pm 0.4}$
Colic	$87.1\pm 6.4$	$86.7\pm 7.4$	$88.7\pm 5.7$	$88.6\pm 5.4$	$88.6\pm 5.4$	$\mathbf{88.9\pm 4.6}$
Credit Approval	$86.4\pm 2.9$	$86.9\pm 3.2$	$88.4\pm 2.8$	$\mathbf{88.4\pm 2.5}$	$\mathbf{88.4\pm 2.5}$	$88.3\pm 2.3$
Credit-g	$82.3\pm 2.5$	$84.2\pm 2.8$	$83.7\pm 2.4$	$\mathbf{84.3\pm 2.2}$	$\mathbf{84.3\pm 2.2}$	$83.9\pm 1.8$
Cylinder Bands	$78.3\pm 4.8$	$84.0\pm 4.8$	$83.0\pm 6.6$	$83.3\pm 6.4$	$83.3\pm 6.4$	$\mathbf{87.0\pm 4.2}$
Diabetes	$63.1\pm 5.2$	$65.1\pm 6.5$	$63.9\pm 8.8$	$\mathbf{65.6\pm 8.2}$	$\mathbf{65.6\pm 8.2}$	$62.4\pm 7.8$
Egg-Eye-State	$70.1\pm 1.3$	$78.0\pm 0.9$	$91.5\pm 0.7$	$91.8\pm 0.8$	$91.8\pm 0.8$	$\mathbf{92.9\pm 0.8}$
Haberman	$\mathbf{35.4\pm 9.5}$	$30.8\pm 14.2$	$27.4\pm 13.4$	$30.6\pm 9.6$	$30.6\pm 9.6$	$34.9\pm 12.9$
Hill-Valley	$40.9\pm 5.4$	$52.9\pm 7.3$	$54.9\pm 4.6$	$56.6\pm 3.8$	$56.6\pm 4.0$	$\mathbf{59.1\pm 6.2}$
Internet Advertisements	$98.0\pm 0.3$	$98.6\pm 0.3$	$\mathbf{98.8\pm 0.4}$	$98.7\pm 0.4$	$98.7\pm 0.4$	$98.7\pm 0.5$
Ionosphere	$94.3\pm 1.7$	$94.4\pm 2.0$	$94.2\pm 2.5$	$94.0\pm 2.5$	$94.0\pm 2.5$	$\mathbf{94.7\pm 2.7}$
MOFN-3-7-10	$\mathbf{100.0\pm 0.0}$	$\mathbf{100.0\pm 0.0}$	$99.8\pm 0.2$	$99.7\pm 0.3$	$99.7\pm 0.3$	$99.7\pm 0.5$
Monks-2	$0.0\pm 0.0$	$69.3\pm 8.7$	$93.1\pm 3.3$	$93.5\pm 3.3$	$93.5\pm 3.3$	$\mathbf{98.5\pm 2.7}$
Phoneme	$68.3\pm 3.0$	$75.4\pm 2.4$	$84.0\pm 3.0$	$84.1\pm 2.7$	$84.1\pm 2.7$	$\mathbf{85.7\pm 2.0}$
Pishing Websites	$94.4\pm 0.4$	$95.3\pm 0.5$	$97.5\pm 0.6$	$97.4\pm 0.6$	$97.4\pm 0.6$	$\mathbf{97.5\pm 0.5}$
Sick	$78.3\pm 6.4$	$88.8\pm 3.9$	$87.5\pm 3.1$	$88.6\pm 3.9$	$88.6\pm 3.9$	$\mathbf{89.7\pm 3.6}$
Sonar	$\mathbf{83.9\pm 8.0}$	$81.3\pm 6.2$	$81.9\pm 11.4$	$83.3\pm 11.1$	$83.3\pm 11.1$	$83.2\pm 11.1$
Spambase	$91.8\pm 1.5$	$93.1\pm 1.7$	$\mathbf{94.2\pm 1.1}$	$94.0\pm 1.2$	$94.1\pm 1.2$	$94.0\pm 1.2$
Steel Plates Fault	$\mathbf{100.0\pm 0.0}$	$\mathbf{100.0\pm 0.0}$	$99.0\pm 0.8$	$99.2\pm 0.6$	$99.2\pm 0.6$	$99.4\pm 0.7$
Tic-Tac-Toe	$84.5\pm 2.6$	$94.8\pm 2.1$	$95.6\pm 1.2$	$95.5\pm 1.2$	$95.5\pm 1.2$	$\mathbf{96.5\pm 1.5}$
Titanic	$\mathbf{58.8\pm 4.3}$	$53.8\pm 4.4$	$53.6\pm 4.2$	$53.6\pm 4.2$	$53.6\pm 4.2$	$53.8\pm 4.4$
ilpd	$\mathbf{41.4\pm 11.4}$	$35.3\pm 15.1$	$35.1\pm 15.8$	$37.5\pm 16.6$	$37.5\pm 16.6$	$33.5\pm 14.6$

Note the exponential RCAM-based ensemble classifier with grid search produced the largest F-measures in 11 of the 28 data sets. In particular, the exponential RCAM with grid search produced outstanding F-measures on the “Monks-2” and “Egg-Eye-State” data sets. For a better comparison of the ensemble classifiers, we followed Demšar’s recommendations to compare multiple classifier models using multiple data sets [32]. The Friedman test rejected the hypothesis that there is no difference between the ensemble classifiers.

A visual interpretation of the outcome of this computational experiment is provided in Figure 1 with the Hasse diagram of the non-parametric Wilcoxon signed-rank test with a confidence level at 95% [33, 34]. In this diagram, an edge means that the classifier on the top statistically outperformed the classifier on the bottom. The outcome of this analysis confirms that the RCAM-based ensemble classifiers statistically outperformed the other ensemble methods: AdaBoost, gradient boosting, and random forest.

Refer to caption — Figure 1: Hasse diagram of Wilcoxon signed-rank test with a confidence level at 95%.

As to the computational effort, Figure 2 shows the average time required by the ensemble classifiers for the prediction of a batch of testing samples. Note that the most expensive method is identity RCAM-based ensemble classifier while the gradient boosting is the cheapest. The exponential RCAM-based ensemble is less expensive than the AdaBoost and quite comparable to the random forest classifier.

Finally, note from Table 1 that some problems such as the “Banknote” and the “MOFN-3-7-10” data sets are quite easy while others such as the “Haberman” and “Hill Valey” are very hard. In order to circumvent the difficulties imposed by each data set, Figure 3 shows a box-plot with the normalized F-measure values provided in Table 1. Precisely, for each data set (i.e., each row in Table 1), we subtracted the mean and divided by the standard deviation of the score values.

The box-plot in Figure 3 confirms the good performance of the RCAM-based ensemble classifiers, including the exponential RCAM-based ensemble classifier with a grid search. Concluding, the boxplots shown on Figures 2 and 3 supports the potential application of the RCAM models as an ensemble of classifiers for binary classification problems.

5 Concluding Remarks

This paper provides a bridge between ensemble methods and associative memories. In general terms, an ensemble method reduces variance and improve the accuracy and robustness by combining a group of base predictors [6, 2]. The rule used to combine the base predictors is one important issue in the design of an ensemble method. In this paper, we propose to combine the base predictors using an associative memory. Associative memory is a model designed for the storage and recall of a set of vectors [11]. Furthermore, an associative memory should be able to retrieve a stored item from a corrupted or partial version of it. In an ensemble method, the memory model is designed for the storage of evaluations of the base classifiers. The associative memory is then fed by a vector with the target of training data as well as the unknown predictions. The output of the ensemble method is obtained from the vector retrieved by the memory.

Specifically, in this paper, we presented ensemble methods based on the recurrent correlation associative memories (RCAMs) for binary classifications. RCAMs, proposed by Chiueh and Goodman [20], are high storage capacity associative memories which, besides Bayesian and kernel trick interpretation, are particularly suited for VLSI implementation [25, 21, 22, 23]. Theorem 1 shows that the RCAM model yields a majority voting classifier whose weights are obtained by a recurrent consult and vote scheme. Moreover, the weights depend on the similarity between the base classifiers and the resulting ensemble. Computational experiments using decision tree as the base classifiers revealed an outstanding performance of the exponential RCAM-based ensemble classifier combined with a grid search strategy to fine-tune its parameter. The exponential RCAM-based ensemble, in particular, outperformed the traditional AdaBoost, gradient boosting, and random forest classifiers.

In the future, we plan to investigate further associative memory-based ensemble methods. In particular, we plan to extend these ensemble methods to multi-class classification problems using, for instance, multistate associative memory models [35, 36, 37, 38].

References

Ponti Jr [2011] M. P. Ponti Jr, Combining classifiers: from the creation of ensembles to the decision fusion, in: 2011 24th SIBGRAPI Conference on Graphics, Patterns, and Images Tutorials, IEEE, 2011, pp. 1–10.
Kuncheva [2014] L. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms, 2 ed., John Wiley and Sons, 2014.
Polikar [2012] R. Polikar, Ensemble Learning, in: C. Zhang, Y. Ma (Eds.), Ensemble Machine Learning: Methods and Applications, Springer, 2012, pp. 1–34. doi:10.1007/978-1-4419-9326-7\_1.
Breiman [1996] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123–140. doi:10.1023/A:1018054314350.
Ho [1998] T. K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 832–844.
Zhang and Ma [2012] C. Zhang, Y. Ma (Eds.), Ensemble Machine Learning: Methods and Applications, Springer, 2012. doi:10.1007/978-1-4419-9326-7.
Géron [2019] A. Géron, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O’Reilly Media, 2019.
Breiman [2001] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32. doi:10.1023/A:1010933404324.
Austin [1987] J. Austin, ADAM: A Distributed Associative Memory for Scene Analysis, in: Proceedings of the IEEE First International Conference on Neural Networks, volume IV, San Diego, 1987, p. 285.
Kohonen [1987] T. Kohonen, Self-organization and associative memory, 2rd edition ed., Springer-Verlag New York, Inc., New York, NY, USA, 1987.
Hassoun and Watta [1997] M. H. Hassoun, P. B. Watta, Associative Memory Networks, in: E. Fiesler, R. Beale (Eds.), Handbook of Neural Computation, Oxford University Press, 1997, pp. C1.3:1–C1.3:14.
Hopfield [1982] J. J. Hopfield, Neural Networks and Physical Systems with Emergent Collective Computational Abilities, Proceedings of the National Academy of Sciences 79 (1982) 2554–2558.
Hopfield and Tank [1985] J. Hopfield, D. Tank, Neural computation of decisions in optimization problems, Biological Cybernetics 52 (1985) 141–152.
Smith et al. [1998] K. Smith, M. Palaniswami, M. Krishnamoorthy, Neural Techniques for Combinatorial Optimization with Applications, IEEE Transactions on Neural Networks 9 (1998) 1301–1318.
Sun [2000] Y. Sun, Hopfield neural network based algorithms for image restoration and reconstruction. II. Performance analysis, IEEE Transactions on Signal Processing 48 (2000) 2119–2131. doi:10.1109/78.847795.
Serpen [2008] G. Serpen, Hopfield Network as Static Optimizer: Learning the Weights and Eliminating the Guesswork., Neural Processing Letters 27 (2008) 1–15. doi:10.1007/s11063-007-9055-8.
McEliece et al. [1987] R. J. McEliece, E. C. Posner, E. R. Rodemich, S. Venkatesh, The capacity of the Hopfield associative memory, IEEE Transactions on Information Theory 1 (1987) 33–45.
Kanter and Sompolinsky [1987] I. Kanter, H. Sompolinsky, Associative Recall of Memory without Errors, Physical Review 35 (1987) 380–392.
Müezzinoǧlu et al. [2005] M. Müezzinoǧlu, C. Güzelis, J. Zurada, An Energy Function-Based Design Method for Discrete Hopfield Associative Memory With Attractive Fixed Points, IEEE Transactions on Neural Networks 16 (2005) 370–378.
Chiueh and Goodman [1991] T. Chiueh, R. Goodman, Recurrent Correlation Associative Memories, IEEE Trans. on Neural Networks 2 (1991) 275–284.
García and Moreno [2004a] C. García, J. A. Moreno, The Hopfield Associative Memory Network: Improving Performance with the Kernel “Trick”, in: Lecture Notes in Artificial Inteligence - Proceedings of IBERAMIA 2004, volume 3315 of Advances in Artificial Intelligence – IBERAMIA 2004, Springer-Verlag, 2004a, pp. 871–880.
García and Moreno [2004b] C. García, J. A. Moreno, The Kernel Hopfield Memory Network, in: P. M. A. Sloot, B. Chopard, A. G. Hoekstra (Eds.), Cellular Automata, Springer Berlin Heidelberg, Berlin, Heidelberg, 2004b, pp. 755–764.
Perfetti and Ricci [2008] R. Perfetti, E. Ricci, Recurrent correlation associative memories: A feature space perspective, IEEE Transactions on Neural Networks 19 (2008) 333–345.
Kultur et al. [2009] Y. Kultur, B. Turhan, A. Bener, Ensemble of neural networks with associative memory (enna) for estimating software development costs, Knowledge-Based Systems 22 (2009) 395–402.
Hancock and Pelillo [1998] E. R. Hancock, M. Pelillo, A Bayesian interpretation for the exponential correlation associative memory, Pattern Recognition Letters 19 (1998) 149–159.
Kittler and Roli [2003] J. Kittler, F. Roli, Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21-23, 2000 Proceedings, Springer, 2003.
Hansen and Salamon [1990] L. K. Hansen, P. Salamon, Neural network ensembles, IEEE transactions on pattern analysis and machine intelligence 12 (1990) 993–1001.
Van Erp et al. [2002] M. Van Erp, L. Vuurpijl, L. Schomaker, An overview and comparison of voting methods for pattern recognition, in: Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition, IEEE, 2002, pp. 195–200.
Ferreira and Figueiredo [2012] A. Ferreira, M. Figueiredo, Boosting Algorithms: A Review of Methods, Theory, and Applications, in: C. Zhang, Y. Ma (Eds.), Ensemble Machine Learning: Methods and Applications, Springer, 2012, pp. 35–85. doi:10.1007/978-1-4419-9326-7\_2.
Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.
Vanschoren et al. [2013] J. Vanschoren, J. N. van Rijn, B. Bischl, L. Torgo, Openml: Networked science in machine learning, SIGKDD Explorations 15 (2013) 49–60. doi:10.1145/2641190.2641198.
Demšar [2006] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
Burda [2013] M. Burda, paircompviz: An R Package for Visualization of Multiple Pairwise Comparison Test Results, 2013. doi:10.18129/B9.bioc.paircompviz.
Weise and Chiong [2015] T. Weise, R. Chiong, An alternative way of presenting statistical test results when evaluating the performance of stochastic approaches, Neurocomputing 147 (2015) 235–238. doi:10.1016/j.neucom.2014.06.071.
Jankowski et al. [1996] S. Jankowski, A. Lozowski, J. Zurada, Complex-Valued Multi-State Neural Associative Memory, IEEE Transactions on Neural Networks 7 (1996) 1491–1496.
Müezzinoǧlu et al. [2003] M. Müezzinoǧlu, C. Güzeliş, J. Zurada, A New Design Method for the Complex-Valued Multistate Hopfield Associative Memory, IEEE Transactions on Neural Networks 14 (2003) 891–899.
Minemoto et al. [2016] T. Minemoto, T. Isokawa, H. Nishimura, N. Matsui, Quaternionic multistate Hopfield neural network with extended projection rule, Artificial Life and Robotics 21 (2016) 106–111. doi:10.1007/s10015-015-0247-4.
Kobayashi [2017] M. Kobayashi, Quaternionic Hopfield neural networks with twin-multistate activation function, Neurocomputing 267 (2017) 304–310. doi:10.1016/j.neucom.2017.06.013.