This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Acceleration of Subspace Learning Machine via Particle Swarm Optimization and Parallel Processing

\authorblockNHongyu Fu\authorrefmark1, Yijing Yang\authorrefmark1, Yuhuai Liu\authorrefmark1, Joseph Lin\authorrefmark1, Ethan Harrison\authorrefmark3, Vinod K. Mishra\authorrefmark2 and C.-C. Jay Kuo \authorrefmark1 \authorblockA\authorrefmark1 University of Southern California, Los Angeles, California, USA
\authorblockA\authorrefmark2 Army Research Laboratory, Aberdeen Proving Ground, Maryland, USA
\authorblockA\authorrefmark3 Mira Costa Highschool, Manhattan Beach, California, USA
Abstract

Built upon the decision tree (DT) classification and regression idea, the subspace learning machine (SLM) has been recently proposed to offer higher performance in general classification and regression tasks. Its performance improvement is reached at the expense of higher computational complexity. In this work, we investigate two ways to accelerate SLM. First, we adopt the particle swarm optimization (PSO) algorithm to speed up the search of a discriminant dimension that is expressed as a linear combination of current dimensions. The search of optimal weights in the linear combination is computationally heavy. It is accomplished by probabilistic search in original SLM. The acceleration of SLM by PSO requires 10-20 times fewer iterations. Second, we leverage parallel processing in the SLM implementation. Experimental results show that the accelerated SLM method achieves a speed up factor of 577 in training time while maintaining comparable classification/regression performance of original SLM.

1 Introduction

Feature-based classification and regression methods, where feature extraction and decision-making are two separate modules, have been well studied for general classification and regression tasks for decades in the field of machine learning. In recent years, deep learning (DL) methods handle feature extraction and decision-making jointly in training deep neural networks (DNNs) through back propagation (BP). In this work, we adopt the traditional machine learning pipeline and aim to improve the high-performance decision-making (i.e., classification and regression) module with features as the input.

Built upon the classical decision tree (DT) [1] idea, SLM has been recently proposed in [2]. SLM offers a high-performance machine learning solution to general classification and regression tasks. Both DT and SLM convert the high-dimensional feature space splitting problem into a 1D feature space splitting problem. DT compares each feature in the raw feature space and chooses the one that meets a certain criterion and searches for the optimal split point, which is the threshold value. In contrast, SLM conducts a linear combination of existing features to find a new discriminant 1D space and find the optimal split point accordingly. It demands to find optimal weights of the linear combination, which is computationally heavy, in the training stage.

It was shown in [2] that SLM achieves better performance than the support vector machine (SVM) [3], DT [1], multilayer perceptron (MLP) [4, 5] with fewer parameters on several benchmarking machine learning datasets. The performance improvement is reached at the expense of higher computational complexity. In this work, we investigate two ideas to accelerate SLM: 1) proposal of a more efficient weight learning strategy via particle swarm optimization (PSO) [6], and 2) software implementation optimization with parallel processing on CPU and GPU.

For the first idea, the search of optimal weights is accomplished by probabilistic search in original SLM. No optimization is involved. The probabilistic search demands a large number of iterations to achieve desired results. It is difficult to scale for a high-dimensional feature space. In contrast, PSO simulates the behavior of a flock of birds, where each bird is modeled by a particle. PSO finds the optimal solution in the search space by updating particles jointly and iteratively. The acceleration of SLM by PSO requires 10-20 times fewer iterations than the original SLM.

For the second idea, SLM’s training time can be significantly reduced through software implementation optimization. In the original SLM development stage, its implementation primarily relies on public python packages such as Scikit-learn[7], NumPy, SciPy [8], etc. After its algorithmic detail is settled, we follow the development pathway of XGBoost [9] and LightGBM [10], implement a C++ package, and integrate it with Cython. Both multithreaded CPU and CUDA are supported.

By combining the above two ideas, accelerated SLM can achieve a speed up factor of 577 in training as compared to the original SLM while maintaining comparable (or even better) classification/regression performance. The rest of this paper is organized as follows. The SLM method is reviewed in Sec. 2. The acceleration of SLM through PSO is introduced in Sec. 3. The optimization of software implementation is presented in Sec. 4. Experimental results are shown in Sec. 5. Finally, concluding remarks are given in Sec. 6.

2 Subspace Learning Machine (SLM)

2.1 Methodology

Inspired by the feedforward multilayer perceptron (FF-MLP) [5], decision tree (DT) [1] and extreme learning machine (ELM) [11], a new machine learning model named SLM was proposed by Fu et al. in [2] recently. Its main idea is subspace partitioning, where an input feature space is partitioned into multiple discriminant sets in a hierarchical manner. The whole input feature space is treated as a parent node and it is partitioned by a hyperplane into two subsets, where each subset has a lower entropy value. Each subset is denoted by a child node. This partitioning process is conducted recursively until samples in leaf nodes are pure enough or the sample number is small enough to avoid overfitting. Majority vote is adopted at each leaf node where all test samples in the node are predicted as the majority class. Different from traditional DT methods, SLM allows multiple splitting of a parent node at one level, which makes the SLM tree wider and shallower to prevent overfitting.

Instead of partitioning a node based on the original input feature space, SLM searches for discriminant projections for each parent node and performs its partitioning in a projected 1D space. The projected 1D space from the input feature vector 𝒙\bm{x} can be expressed as

F𝒂={f(𝒂)f(𝒂)=𝒂T𝒙},F_{\bm{a}}=\{f(\bm{a})\mid f(\bm{a})=\bm{a}^{T}\bm{x}\}, (1)

where

𝒂=(a1,,ad,,aD)T,𝒂=1,\bm{a}=(a_{1},\cdots,a_{d},\cdots,a_{D})^{T},\hskip 14.22636pt||\bm{a}||=1, (2)

is the projection vector on the unit hypersphere in D\mathbb{R}^{D} and DD is the dimension of the input feature space. The 1D subspace can be partitioned into two disjoint sets based on a threshold, tbt_{b}, selected by the discriminant feature test (DFT) in [12]. They are written as

F𝒂,tb,+\displaystyle F_{\bm{a},t_{b},+} =\displaystyle= {f(𝒂)𝒂T𝒙tb},\displaystyle\{f(\bm{a})\mid\bm{a}^{T}\bm{x}\geq t_{b}\}, (3)
F𝒂,tb,\displaystyle F_{\bm{a},t_{b},-} =\displaystyle= {f(𝒂)𝒂T𝒙<tb}.\displaystyle\{f(\bm{a})\mid\bm{a}^{T}\bm{x}<t_{b}\}. (4)

We summarize the SLM tree construction as follows.

  1. 1.

    A discriminant subspace S0{S}^{0} of dimension D0{D}_{0} is selected from the original feature space.

  2. 2.

    pp projection vectors are generated to project feature space S0{S}^{0} to pp 1D subspaces. Among them, the discriminability of each projected subspace is evaluated using the DFT loss in [12], to select the best qq discriminant 1D subspaces. Correlation among them are considered to ensure diversity.

  3. 3.

    The node is split based on each selected projected subspaces, denoted as S1{S}^{1}. This completes the node splitting in one level of the SLM tree.

  4. 4.

    Conduct node splitting recursively until the stopping criteria are met.

2.2 Probabilistic Selection in SLM

The most important and challenging step in SLM is the selection of discriminant projection vectors that can separate samples into disjoint sets of lower entropy. Each projection vector is a linear combination of one-hot basis vectors, 𝒆d\bm{e}_{d}, d=1,,Dd=1,\cdots,D, in form of

𝒂=a1𝒆1+,ad𝒆d++aD𝒆D,\bm{a}=a_{1}\bm{e}_{1}+\cdots,a_{d}\bm{e}_{d}+\cdots+a_{D}\bm{e}_{D}, (5)

where ada_{d} represents the coefficient of the corresponding basis vector 𝒆d\bm{e}_{d}.

A probabilistic search scheme was proposed in [2] to achieve this task. It first orders 𝒆d\bm{e}_{d} based on its discriminant power evaluation using the DFT loss from the highest to the lowest. Then, Eq. (5) can be rewritten as

𝒂=a1𝒆1+,ad𝒆d++aD𝒆D.\bm{a}=a^{\prime}_{1}\bm{e^{\prime}}_{1}+\cdots,a^{\prime}_{d}\bm{e^{\prime}}_{d}+\cdots+a^{\prime}_{D}\bm{e^{\prime}}_{D}. (6)

where vectors 𝒆d\bm{e^{\prime}}_{d} are ordered from the most to the least discriminant ones. Then, different sets of coefficients {a1,a2,,aD}\{a^{\prime}_{1},a^{\prime}_{2},\cdots,a^{\prime}_{D}\} form different projection vectors. Although random selection can be performed to find optimal sets of coefficients, the search space is too large to be practical. Three hyper-parameters are proposed to guide the search in probabilistic search as described below.

  • PdP_{d}: the probability of selecting coefficient ada^{\prime}_{d}
    Since basis vectors are rank ordered in decreasing discriminant power in Eq. (6), we let PdP_{d} decrease exponentially as dd increases. In other words, the more discriminant a basis, the higher likelihood being selected. The resulting 𝒂\bm{a} tends to be more discriminant as well.

  • AdA_{d}: the dynamic range of coefficient ada^{\prime}_{d}
    ada^{\prime}_{d} is randomly selected among integer values expressed as

    ad=0,±1,±2,,±Ad,a^{\prime}_{d}=0,\pm 1,\,\pm 2,\,\cdots,\,\pm\lfloor A_{d}\rfloor, (7)

    where

    Ad=α0exp(αd).A_{d}=\alpha_{0}\exp(-\alpha d). (8)

    The more discriminant a basis, the higher dynamic range in selection.

  • RR: the number of selected coefficients
    A subset of RR basis vectors out of DD are considered in the linear combination in Eq. (6). The remaining (DR)(D-R) corresponding coefficients are set to 0. This further reduces the search space for a large DD value.

Given a set of selected projection vectors using the probabilistic search scheme as stated above, the cosine similarity between each pair is measured. To ensure the diversity of projected 1D spaces, we prune projection vectors of higher correlations from the final selected set.

3 Acceleration via Adaptive Particle Swarm Optimization (APSO)

As described in Sec. 2, the probabilistic search is used in the original SLM to obtain good projection vectors. As the feature dimension increases, the search space grows exponentially and the probabilistic search scheme does not scale well. In this section, we adopt a metaheuristic algorithm called the particle swarm optimization (PSO) [13] to address the same problem with higher efficiency.

3.1 Introduction to PSO

PSO is an evolutionary algorithm proposed by Kennedy and Eberhart in 1995 [6]. It finds the optimal solution in the search space by simulating the behavior of flock using a swarm of particles. Each particle in the swarm has a position vector and a velocity vector. Mathematically, the iith particle has the following position and velocity vectors:

𝐱i\displaystyle{\bf x}_{i} =\displaystyle= [x1,x2,,xD]T\displaystyle\left[x_{1},x_{2},\cdots,x_{D}\right]^{T} (9)
𝐯i\displaystyle{\bf v}_{i} =\displaystyle= [v1,v2,,vD]T\displaystyle\left[v_{1},v_{2},\cdots,v_{D}\right]^{T} (10)

where DD is the dimension of the search space, respectively. The velocity and position vectors of the iith particle are updated in each iteration via

𝐯i,t+1=ω𝐯i,t+c1r1[𝐩𝐁i,t𝐱i,t]+c2r2[𝐠𝐁𝐱i,t],\displaystyle{\bf v}_{i,t+1}=\omega{\bf v}_{i,t}+c_{1}r_{1}[{\bf pB}_{i,t}-{\bf x}_{i,t}]+c_{2}r_{2}[{\bf gB}-{\bf x}_{i,t}], (11)
𝐱i,t+1=𝐱i,t+𝐯i,t.\displaystyle{\bf x}_{i,t+1}={\bf x}_{i,t}+{\bf v}_{i,t}. (12)

Eq. (11) has the following parameters:

  • inertia weight ω\omega used to control the influence of the current speed of the particle on the update.

  • 𝐠𝐁{\bf gB}: the position vector of the optimal particle in the population.

  • 𝐩𝐁i,t{\bf pB}_{i,t}: the best position of the iith particle in the history update process up to iteration tt.

  • c1c_{1} and c2c_{2}: two learning factors.

  • r1r_{1} and r2r_{2}: two uniformly distributed random values between 0 and 1 used to simulate the stochastic behavior of particles.

The initial position and velocity vectors of all particles are randomly selected within a preset range. At each iteration, the velocity and position of each particle are updated according to Eqs. (11) and (12), respectively. Furthermore, we calculate the current loss and update 𝐠𝐁{\bf gB} and 𝐩𝐁{\bf pB} accordingly. If the algorithm converges, the position vector of the global optimal particle is the output. The pseudo-codes of the PSO are provided in Algorithm 1.

Algorithm 1 Particle Swarm Optimization (PSO)

Input: Population, Dimension and MaxIteration
Output: Best Particle’s position vector and loss

1:Initialize all particles
2:for iteration<MaxIterationiteration<MaxIteration do
3:     …
4:     for i<Populationi<Population do
5:         Update ViV_{i} and XiX_{i}
6:         if lossi<pbestiloss_{i}<pbest_{i} then
7:              Update pbestipbest_{i}
8:              if lossi<gbestloss_{i}<gbest then
9:                  Update gbestgbest               
10:         else
11:              pass          
12:         i = i+1 //next particle      
13:     iteration = iteration + 1 //next iteration
14:return gbestgbest

3.2 Adaptive PSO (APSO) in SLM

The PSO algorithm has a couple of variants [14, 15, 16]. The search space of the projection vector in our problem is complex and there are quite a few local minima. A standard PSO can easily get stuck to the local minima. One variant of PSO, called the adaptive particle swarm optimization (APSO)[17], often yields better performance for multimodal problems so that it is adopted as an alternative choice in the SLM implementation.

Before an iteration, APSO calculates the distribution of particles in the space and identify it as one of the following four states: Exploration, Exploitation, Convergence, and Jumpout. APSO chooses parameters c1c_{1} and c2c_{2} in different states dynamically so as to make particles more effective under the current distribution. They are summarized below.

  • In the exploration state, APSO increases the c1c_{1} value and decreases the c2c_{2} value. This allows particles to explore in the whole search space more freely and individually to achieve better individual optimal position.

  • In exploitation state, APSO increases the c1c_{1} value slightly and decreases the c2c_{2} value. As a result, particles can leverage the local information more to find the global optimum.

  • In the convergence state, APSO increases both c1c_{1} and c2c_{2} slightly. An elite learning strategy is also introduced to promote the convergence of globally optimal particles, which makes the search more efficient.

  • In the jumping-out state, APSO decreases the c1c_{1} value and increases the c2c_{2} value. This allows particles move away from their current cluster more easily. Thus, the particle swarm can jump out of the current local optimal solution.

For classification tasks in SLM, particle’s position vector is the projection vector and the DFT loss is utilized as the loss function. In original SLM, we split samples into several child nodes at each tree level. Since APSO can find the global optimum with the lowest loss function, only one globally optimal projection is selected per node in the accelerated version. As a result, the resulting SLM tree is a binary tree. Furthermore, in the accelerated SLM algorithm, the top nn discriminant features in each node are re-computed and they serve as the input dimension of the APSO algorithm. We will show in the experimental section that APSO demands fewer iterations to find the global optimum than probabilistic search. Thus, the overall search speed can be improved. The pseudo-codes of the SLM tree building based on APSO is given in Algorithm 2.

Algorithm 2 SLM Tree Building with APSO
1:if Node meet the split condition then
2:     Run DFT for this node
3:     Select top nn channels from DFT result
4:     Run Adaptive PSO with nn Dimension
5:     Use gbestgbest vector for partitioning
6:     Get two child nodes for next building process
7:else
8:     Mark this node as a leaf node
9:return

4 Acceleration via Parallel Processing

The SLM algorithm in [2] was implemented in pure Python. While Python code is easy to write, the running speed is slow. Cython is a superset of Python that bridges the gap between Python and C. It allows static typing and compilation from Cython to C. Furthermore, there are a few commonly used math libraries in Python such as Scipy and scikit-learn.

The first step towards run-time acceleration is to convert the computational bottleneck of the SLM implementation from Python to Cython. The bottleneck is the search of the optimal projection vector. For a projection vector at a node, we calculate the loss at each split point so as to find the minimum loss in the 1D space. Since these computations do not depend on each other, they can be computed in parallel in principle. Python does not support multithreading due to the presence of the Global Interpreter Lock (GIL). To this end, multi-processing was deemed unsuitable due to higher overhead in the current context.

Instead, we implement multithreaded C++ extension for the optimal projection vector search, and integrate it into the remaining Cython code. In the C++ extension, we spawn multiple threads, where each thread covers the computation of a fixed number of splits per selected projection vector. The launched thread number is equal to the minimum of the physical core number to prevent excessive context switching. A task is defined as calculating the loss on a single split of a projected 1D subspace. These tasks are evenly allocated to each thread wherein each thread processes them sequentially.

Compared to CPUs, modern GPUs have thousands of cores and context switching occurs significantly less frequently. Thus, for GPU processing, each thread is responsible for a single task in our design. Once the task is completed, the GPU core moves onto the next available task, if any. Once the computation is completed, the loss values are returned to the main process where the minimum loss projection and split are computed.

Table 1: Comparison of training run-time (in seconds) of 6 settings for 9 classification datasets.
Settings
circle-and-
ring
2-new-
moons
4-new-
moons
Iris Wine B.C.W Pima Ionoshpere Banknote
SLM A 7.026 4.015 6.019 82.939 65.652 234.597 339.077 248.410 118.000
B 0.287 0.188 0.200 0.465 0.808 2.931 3.920 5.217 0.697
C 0.346 0.0943 0.236 0.295 0.700 2.657 3.263 3.838 0.439
D 4.071 3.159 3.245 3.205 3.572 3.956 4.660 5.021 4.370
E 7.597 3.146 7.200 4.188 3.106 11.605 13.892 8.123 8.588
F 0.270 0.0887 0.278 0.192 0.831 2.837 2.738 1.371 0.619
SLM Forest A 160.264 107.695 157.307 2333.514 2605.288 5067.356 10024.238 6582.104 3020.587
B 8.507 5.272 8.471 17.142 26.254 134.301 101.631 152.625 40.365
C 7.910 4.714 8.170 13.068 22.176 116.843 88.972 137.390 16.341
D 7.708 4.954 7.678 5.946 15.066 98.969 48.987 80.547 16.511
E 220.021 96.712 219.002 105.768 94.788 360.089 396.989 247.069 259.000
F 8.75 4.058 9.02 5.822 20.081 98.490 31.684 75.964 14.054
SLM Boost A 143.332 116.156 151.904 2347.904 2591.919 7118.338 9887.949 6395.744 3412.654
B 7.459 5.185 7.533 16.087 21.873 117.01 104.47 148.586 23.443
C 7.987 3.434 8.293 12.144 15.624 115.226 80.273 131.712 14.207
D 7.741 4.894 8.182 6.093 15.573 97.599 46.057 76.589 15.588
E 233.071 97.541 233.347 107.709 95.534 362.295 396.989 247.069 262.546
F 6.837 3.581 8.425 5.915 20.159 94.413 40.413 79.012 15.516
Table 2: Comparison of training run-time (in seconds) of 6 settings for 6 regression datasets.
Settings Make Friedman1 Make Friedman2 Make Friedman3 Boston California_housing Diabetes
SLR A 152.291 60.232 186.535 153.173 756.868 147.933
B 2.136 0.653 2.137 1.255 26.417 1.136
C 0.937 0.331 2.100 0.780 5.625 0.608
D 1.577 0.448 3.832 1.31 5.508 0.722
E 29.742 28.294 44.511 30.608 96.319 27.162
F 0.639 0.332 0.755 0.675 5.891 0.322
SLR Forest A 5406.271 3346.595 7459.372 6454.979 21092.217 4413.089
B 62.527 19.244 57.117 55.204 1049.859 46.622
C 35.576 13.533 63.551 30.943 254.415 29.606
D 19.408 6.48 42.182 18.04 125.104 17.242
E 1025.801 940.359 1051.498 792.33 3190.336 775.153
F 17.993 13.718 19.898 11.178 193.741 10.879
SLR Boost A 4678.86 1955.517 6770.808 4097.66 21612.672 4200.828
B 62.156 19.724 59.435 38.154 817.688 34.767
C 26.704 10.546 70.716 29.149 159.434 20.528
D 19.293 6.551 51.222 19.817 124.702 38.897
E 919.874 920.619 1834.926 910.271 3336.018 901.092
F 14.170 13.114 24.759 12.186 195.459 10.594

5 Experiments

Datasets. We follow the experimental setup in [2] and conduct performance benchmarking on 9 classification datasets (i.e., Circle-and-ring, 2-new-moons, 4-new-moons, Iris, Wine, B.C.W, Pima, Ionoshpere, and Banknote) and 6 regression datasets (i.e., Make Friedman1, Makefriedman2, Makefriedman3, Boston, California housing and Diabetes). For details of these 15 datasets, we refer readers to [2].

Benchmarking Algorithms and Accelation Settings. Ensemble methods are commonly used in the machine learning field to boost the performance. An ensemble model aims to obtain better performance than each constituent model alone. Inspired by the random forest (RF) [18] and XGBoost for DT, SLM Forest and SLM Boost are bagging and boosting methods for SLM, respectively. The counter part of SLM for the regression task is called the subspace learning regressor (SLR). Again, SLR has two ensemble methods; namely, SLR Forest and SLR Boost. They are detailed in [2].

We compare the run-time of six settings of SLM, SLM Forest and SLM Boost three methods in Table 1 and six settings of SLR, SLR Forest and SLR Boost in Table 2. They are:

  • A

    Probabilistic search in pure Python,

  • B

    Probabilistic search in Cython and C++ singlethreaded,

  • C

    Probabilistic search in Cython and C++ multithreaded,

  • D

    Probabilistic search in CUDA/GPU,

  • E

    APSO in pure Python,

  • F

    APSO in Cython and C++ multithreaded,

For a given dataset, settings [A]-[D] provide the same classification (or regression) performance in terms of classification accuracy (or regression MSE) while settings [E] and [F] offer the same classification accuracy (or regression MSE). Their main difference lies in the model training time. All settings are implemented on Intel(R) Xeon(R) CPU E5-2620 v3 @2.40GHz with 12 cores 24 threads and Nvidia Quadro M6000 GPU. All hyperparameters are the same. The number of trees for two SLM/SLR ensemble methods (i.e., SLM/SLR Forest and SLM/SLR Boost) is set to 30.

Accelation by Parallel Processing. We compare the run-time of SLM, SLM Forest and SLM Boost for the 9 classification datasets in Table 1 and that of SLR, SLR Forest and SLR Boost for the 6 regression datasets in Table 2. The shortest run-time is highlighted in bold face. We see from the tables that simply changing the implementation from Python to C++ yields a speed-up factor of x40 to x100 across all datasets. After that, the speed-up is incremental. Multithreading should lead to increased performance in theory. Yet, it is greatly influenced by the state of the machine in practice. For example, caching, task scheduling, and whether other processors are used by other processes. Moreover, additional overhead occurs in spawning, managing, and destroying threads. Typically, a thread of longer computation is more suitable for multithreading. Thus, the performance improvement in smaller datasets, where computations take less time, is less pronounced. Similarly, for a single SLM tree training, the model training time on GPU is notably worse due to the overhead of moving data to GPU memory. As the training scales (e.g., with ensembles of SLM/SLR trees such as SLM/SLR Forest and SLM/SLR Boost), CUDA/GPU generally outperforms multithreading.

Table 3: Comparison of classification accuracy (in terms of %) of SLM, SLM Forest and SLM Boost probabilistic search and APSO acceleration methods on 9 classification datasets.
Settings
circle-and-
ring
2-new-
moons
4-new-
moons
Iris Wine B.C.W Pima Ionoshpere Banknote Number of Iteration
SLM [A,B,C,D] 96.50 100 99.75 98.33 98.61 97.23 77.71 90.07 99.09 1000
SLM [E,F] 96.50 100 99.75 98.33 100 98.68 77.71 92.19 99.64 110
SLM Forest [A,B,C,D] 100 100 100 98.33 100 97.36 79.00 95.71 99.81 1000
SLM Forest [E,F] 100 100 100 98.33 98.61 97.80 77.71 95.03 100 110
SLM Boost [A,B,C,D] 100 100 100 98.33 100 98.83 77.71 94.33 100 1000
SLM Boost [E,F] 100 100 100 98.33 100 97.80 78.34 94.33 100 110
Table 4: Comparison of regression mean-squared-errors (MSE) of SLR, SLR Forest and SLR Boost probabilistic search and APSO acceleration methods on 6 regression datasets.
settings Make Friedman1 Makefriedman2 Makefriedman3 Boston California housing Diabetes Number of Iteration
SLR [A,B,C,D] 20.46 125830 0.078 65.68 1.172 5238.8 2000
SLR [E,F] 13.52 12001 0.022 27.88 0.575 4018.7 110
SLR Forest [A,B,C,D] 5.46 1641 0.007 13.18 0.407 2624.4 2000
SLR Forest [E,F] 4.68 2144 0.012 13.48 0.419 2640.7 110
SLR Boost [A,B,C,D] 3.85 670 0.008 13.07 0.365 2710.0 2000
SLR Boost [E,F] 3.95 662 0.005 12.91 0.341 2711.2 110

Acceleration by APSO. To demonstrate the effectiveness of APSO, we compare the classification and regression performance of probabilistic search and APSO in Tables 3 and 4, respectively. Their performance in terms of classification accuracy and regression MSE is comparable. As shown in the last column of both tables, the iteration number of APSO is about 10% (for SLM) and 5% (for SLR) of that of probabilistic search.

Joint Acceleration by Parallel Processing and APSO. We can see the effect of joint acceleration of parallel processing and APSO from Tables 1 and 2 method F. We observe that the single-threaded C++ implementation of APSO has a comparable speed of the GPU version of probabilistic search. We have not yet done the GPU version of APSO acceleration but expect that it will provide another level of boosting in shortening the training run time.

In summary, APSO acceleration reduces the time complexity of SLM/SLR by reducing the number of iterations to 5-10%. Its C++ implementation and parallel process reduces the training time of each iteration by a factor of x40 to x100. With the combination of both, APSO SLM/SLR with C++ implementation gives the best overall performance in training time, classification accuracy and regression error. As compared to the original SLM/SLR, its training time can be accelerated by a factor up to 577.

6 Conclusion and Future Work

Two ways were proposed to accelerate SLM in this work. First, we applied the particle swarm optimization (PSO) algorithm to speed up the search of a discriminant dimension, which was accomplished by probabilistic search in original SLM. The acceleration of SLM by PSO requires 10-20 times fewer iterations. Second, we leveraged parallel processing in the implementation. It was shown by experimental results that, as compared to original SLM/SLR, accelerated SLM/SLR can achieve a speed up factor of 577 in training time while maintaining comparable classification and regression performance.

The datasets considered in [2] and this work are of lower dimensions. Recently, research on unsupervised representation of images and videos has been intensively studied, e.g., [19, 20, 21, 22, 23, 24, 25], learned representations and their labels can be fed into a classifier/regressor for supervised classification and regression. This leads to the traditional pattern recognition paradigm, where feature extraction and classification modules are cascaded for various tasks. The feature dimensions for these problems are in the order of hundreds. It is important to develop SLM for high dimensional image and video data to explore its full potential.

References

  • [1] L. Breiman, J. Friedman, C. Stone, and R. Olshen, “Classification and regression trees (crc, boca raton, fl),” 1984.
  • [2] H. Fu, Y. Yang, V. K. Mishra, and C.-C. J. Kuo, “Subspace learning machine (slm): Methodology and performance,” arXiv preprint arXiv:2205.05296, 2022.
  • [3] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [4] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.
  • [5] R. Lin, Z. Zhou, S. You, R. Rao, and C.-C. J. Kuo, “From two-class linear discriminant analysis to interpretable multilayer perceptron design,” arXiv preprint arXiv:2009.04442, 2020.
  • [6] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95-international conference on neural networks, vol. 4, pp. 1942–1948, IEEE, 1995.
  • [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
  • [8] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, et al., “Scipy 1.0: fundamental algorithms for scientific computing in python,” Nature methods, vol. 17, no. 3, pp. 261–272, 2020.
  • [9] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794, 2016.
  • [10] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, 2017.
  • [11] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489–501, 2006.
  • [12] Y. Yang, W. Wang, H. Fu, and C.-C. J. Kuo, “On supervised feature selection from high dimensional feature spaces,” arXiv preprint arXiv:2203.11924, 2022.
  • [13] R. C. Eberhart, Y. Shi, and J. Kennedy, Swarm intelligence. Elsevier, 2001.
  • [14] R. A. Krohling and L. dos Santos Coelho, “Coevolutionary particle swarm optimization using gaussian distribution for solving constrained optimization problems,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 36, no. 6, pp. 1407–1416, 2006.
  • [15] J. J. Liang, A. K. Qin, P. N. Suganthan, and S. Baskar, “Comprehensive learning particle swarm optimizer for global optimization of multimodal functions,” IEEE transactions on evolutionary computation, vol. 10, no. 3, pp. 281–295, 2006.
  • [16] Y. Shi and R. C. Eberhart, “Fuzzy adaptive particle swarm optimization,” in Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546), vol. 1, pp. 101–106, IEEE, 2001.
  • [17] Z.-H. Zhan, J. Zhang, Y. Li, and H. S.-H. Chung, “Adaptive particle swarm optimization,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 39, no. 6, pp. 1362–1381, 2009.
  • [18] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  • [19] Y. Chen and C.-C. J. Kuo, “Pixelhop: A successive subspace learning (ssl) method for object recognition,” Journal of Visual Communication and Image Representation, vol. 70, p. 102749, 2020.
  • [20] Y. Yang, H. Fu, and C.-C. J. Kuo, “Design of supervision-scalable learning systems: Methodology and performance benchmarking,” arXiv preprint arXiv:2206.09061, 2022.
  • [21] C.-C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y. Chen, “Interpretable convolutional neural networks via feedforward design,” Journal of Visual Communication and Image Representation, 2019.
  • [22] Z. Zhou, H. Fu, S. You, C. C. Borel-Donohue, and C.-C. J. Kuo, “Uhp-sot: An unsupervised high-performance single object tracker,” in 2021 International Conference on Visual Communications and Image Processing (VCIP), pp. 1–5, IEEE, 2021.
  • [23] Z. Zhou, H. Fu, S. You, and C.-C. J. Kuo, “Unsupervised lightweight single object tracking with uhp-sot++,” arXiv preprint arXiv:2111.07548, 2021.
  • [24] Z. Zhou, H. Fu, S. You, and C.-C. J. Kuo, “Gusot: Green and unsupervised single object tracking for long video sequences,” arXiv preprint arXiv:2207.07629, 2022.
  • [25] Y. Zhu, X. Wang, H.-S. Chen, R. Salloum, and C.-C. J. Kuo, “A-pixelhop: A green, robust and explainable fake-image detector,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8947–8951, IEEE, 2022.