This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

cuPSO: GPU Parallelization for Particle Swarm Optimization Algorithms

Chuan-Chi Wang National Taiwan UniversityTaipeiTaiwan [email protected] Chun-Yen Ho National Taiwan UniversityTaipeiTaiwan [email protected] Chia-Heng Tu 0000-0001-8967-1385 National Cheng Kung UniversityTainanTaiwan [email protected]  and  Shih-Hao Hung 0000-0003-2043-2663 National Taiwan UniversityTaipeiTaiwan [email protected]
Abstract.

Particle Swarm Optimization (PSO) is a stochastic technique for solving the optimization problem. Attempts have been made to shorten the computation times of PSO based algorithms with massive threads on GPUs (graphic processing units), where thread groups are formed to calculate the information of particles and the computed outputs for the particles are aggregated and analyzed to find the best solution. In particular, the reduction-based method is considered as a common approach to handle the data aggregation and analysis for the calculated particle information. Nevertheless, based on our analysis, the reduction-based method would suffer from excessive memory accesses and thread synchronization overheads. In this paper, we propose a novel algorithm to alleviate the above overheads with the atomic functions. The threads within a thread group update the calculated results atomically to the intra-group data queue conditionally, which prevents the frequent accesses to the memory as done by the parallel reduction operations. Furthermore, we develop an enhanced version of the algorithm to alleviate the synchronization barrier among the thread groups, which is achieved by allowing the thread groups to run asynchronously and updating to the global, lock-protected variables occasionally if necessary. Our experimental results show that our proposed algorithm running on the Nvidia GPU is about 200 times faster than the serial version executed by the Intel Xeon CPU. Moreover, the novel algorithm outperforms the state-of-the-art method (the parallel reduction approach) by a factor of 2.2.

copyright: rightsretaineddoi: 10.1145/3477314.3507142isbn: 978-1-4503-8025-6/22/02article: 4price: 15.00

1. Introduction

Particle Swarm Optimization (PSO), introduced in 1995 by Kennedy and Eberhart (Kennedy and Eberhart, 1995), is a powerful optimization algorithm based on a stochastic optimization technique. A PSO algorithm searches the optimum of the target function, called fitness function, by simulating social behaviors with the particles, e.g., the movement for a bird flock or fish school to look for food. Thanks to the versatility of solving various practical problems (PSO, 2001; Poli, 2008), PSO has been evolved into sophisticated variants to improve the optimization performance (Zhang and Xie, 2003; Krink and Løvbjerg, 2002; Niknam and Amiri, 2010) and to accelerate the execution efficiency with different types of hardware platforms, including CPUs (Wang et al., 2008), Field Programmable Gate Arrays (FPGAs) (Venayagamoorthy and Gudise, 2004; Yamada et al., 2010), and GPUs (Veronese and Krohling, 2009).

As a population-based metaheuristic algorithm, it is usually implemented by performing the stochastic search process iteratively, which would demand a large number of particle status updates and the target function evaluations when handling a sophisticated problem. This computation hungry nature poses a challenge for real-time applications to dynamic environments. For example, PSO could be used to track moving objects and to determine the human postures in the computer vision applications. Hence, the capability of fast convergence of PSO is critical to fit the real-time requirements.

To boost the execution performance of PSO algorithms, GPUs have been adopted to solve the PSO problems with a large number of parallel threads. In particular, particles are partitioned into groups to be handled by the underlying thread groups for the status updates and fitness evaluations, and the data calculated by different thread groups are aggregated and analyzed to obtain the best solution, which is the optimization goal for the next iteration. The iterative process stops when achieving the target precision or the specified iteration number. To improve the execution efficiency, the parallel reductions are usually adopted (Harris, [n. d.]), to expedite the data aggregation and analysis process. Nevertheless, the reduction-based methods require a considerable amount of memory accesses to keep the intermediate results and a synchronization operation for waiting the thread groups to write their intermediate results to the memory before the best data is obtained for this iteration. The above two overheads hinder the potential of performance acceleration for solving the PSO problems.

In this paper, we propose a novel algorithm to alleviate the excessive memory access overhead. It is referred to as the queue algorithm since the atomic operation is used to protect the conditional accesses to the memory locations when parallel threads within a group update their results, where the behavior of the sequential accesses is similar to that of the queue data structure. In addition, an enhanced algorithm is developed to further improve the execution efficiency by removing the synchronization barrier when the thread groups update their results, which is done by introducing an atomic lock to control the accesses when updating the best data for the current iteration. The enhanced algorithm, referred to as the queue lock algorithm, performs especially well for the PSO problem with lower complexity, i.e., the problem domain is formulated as one dimensional space. The proposed algorithms have been implemented and they are tested with the two problem configurations with low and high complexities (i.e., 1 dimension and 120 dimensions) on the machine with the Intel Xeon processor and the Nvidia GTX 1080 Ti GPU. Our results show that it achieves over 200x speedups by using the GPU, compared with the serial version running on the CPU. Besides, the proposed algorithm is faster than the state-of-the-art method (the reduction-based approach (CHEN Feng, 2014)) by a factor of 2.2, in terms of the computation time of the PSO algorithm. The source code of this work is online available111https://github.com/wang2346581/cuPSO, and we believe that our efforts further move a step forward for the real-time processing of PSO problems. The contributions of this work are summarized as follows.

  1. (1)

    Propose the queue algorithm to reduce the concurrent memory access delays.

  2. (2)

    Propose the queue lock algorithm to further remove the synchronization overhead.

  3. (3)

    Implement the above algorithms and compare their results against those achieved by the state-of-the-art methods, i.e., the parallel reduction and the loop-unrolling methods. The experimental results demonstrate the effectiveness and efficiency of our proposed algorithms.

The rest of this paper is organized as follows. Section 2 describes the related work. Section 3 provides the background of the PSO algorithms for serial and parallel executions. Section 4 presents the design of our proposed algorithms. Section 5 details the implementation remarks. Section 6 shows the experimental results. Section 7 concludes this paper and gives the future work.

2. Particle Swarm Optimization (PSO) Algorithms

2.1. Standard PSO Algorithm (SPSO)

The parameters of the Standard Particle Swarm Optimization (SPSO) algorithm are defined in Table 1. In general, the parameters ww, c1c1, and c2c2 are changed in different practical situations, controlling the behavior and efficiency of the SPSO algorithm. max_posmax\_pos, min_posmin\_pos, max_vmax\_v, and min_vmin\_v restrict the lower and upper boundaries of each dimension in the search space for the position and velocity respectively. max_itermax\_iter performs the termination criterion of the number of iterations. particle_cntparticle\_cnt is the total number of the particles. Moreover, posipos_{i}, viv_{i}, fitifit_{i}, pbest_posipbest\_pos_{i}, and pbest_fitipbest\_fit_{i} are the necessary information of each particlei, which represent as the position, speed, fitness, best-known position, and best-known fitness respectively. The subscript ii is the index of each particle. Finally, gbest_posgbest\_pos is the best-fitness point ever found by the whole swarm, and gbest_fitgbest\_fit is the best-fitness value obtained by corresponding best-fitness point gbest_posgbest\_pos.

Table 1. Description of PSO required parameters.
Name Description
ww the inertia weight
c1c1 the cognitive coefficient
c2c2 the social coefficient
max_posmax\_pos the maximum position of the particles
min_posmin\_pos the minimum position of the particles
max_vmax\_v the maximum velocity of the particles
min_vmin\_v the maximum velocity of the particles
max_itermax\_iter the number of the iterations
particle_cntparticle\_cnt the total number of the particles
posipos_{i} the position of each particlei
viv_{i} the velocity of each particlei
fitifit_{i} the fitness of each particlei
pbest_posipbest\_pos_{i} the best-known position of each particlei
pbest_fitipbest\_fit_{i} the best-known fitness of each particlei
gbest_posgbest\_pos the best-known position of the whole swarm
gbest_fitgbest\_fit the best-known fitness of the whole swarm

Algorithm 1 shows the full workflow of the SPSO algorithm, and it could be simply decomposed into 5 steps. The algorithm would continue operating from step 2 to step 5 until satisfying the termination condition as follows: (1) initialize the information of each particle, including position, velocity, fitness, the best-known position of each particle, the best-known fitness of each particle, global best-known position, and global best-known fitness. The position and velocity are initialized with a uniformly distributed random vector, constricted in the fixed range to ensure the convergence. (2) update the velocity of each particle as Equation 1 and the position of each particle as Equation 2 within the domain of the fitness function at time tt to explore the whole search-space. r1 and r2 are uniformly distributed on random value in [0, 1]. (3) calculate each particle fitness by its position and fitness function. (4) update the best-known position and the fitness, local best for short, while the new fitness is better than the previous one. (5) update the global best-known position and fitness, global best for short, while the fitness of the particle is better than the global best-known fitness.

(1) vi(t+1)=wv(t)+c1r1(pbest_posi(t)posi(t))+c2r2(gbest_posi(t)posi(t))\begin{split}v_{i}(t+1)=&w*v(t)+c1*r1(pbest\_pos_{i}(t)-pos_{i}(t))\\ &+c2*r2(gbest\_pos_{i}(t)-pos_{i}(t))\end{split}
(2) posi(t+1)=posi(t)+vi(t+1)pos_{i}(t+1)=pos_{i}(t)+v_{i}(t+1)
Algorithm 1 Sequential SPSO algorithm
1:for ii in particle_cntparticle\_cnt do \triangleright Step1: Initialization
2:     Initialize posipos_{i} and viv_{i} randomly.
3:     Evaluate fitifit_{i} by posipos_{i} and fitnessfunction.
4:     Initialize pbest_fitipbest\_fit_{i} and pbest_posipbest\_pos_{i}.
5:     Update gbest_posgbest\_pos and gbest_fitgbest\_fit.
6:end for
7:for jj in max_itermax\_iter do \triangleright Step2-5: Computation
8:     for ii in particle_cntparticle\_cnt do
9:         Update viv_{i} by posipos_{i}, pbest_posipbest\_pos_{i}, and gbest_posgbest\_pos. \triangleright Step2
10:         Keep viv_{i} in the range (min_vi,max_vi)(min\_v_{i},max\_v_{i})
11:         Update posipos_{i} by posipos_{i} and viv_{i}.
12:         Keep posipos_{i} in the range (min_posi,max_posi)(min\_pos_{i},max\_pos_{i}).
13:         Evaluate fitifit_{i} by posipos_{i}. \triangleright Step3
14:         if fiti>pbest_fitifit_{i}\textgreater pbest\_fit_{i} then \triangleright Update local best
15:              Update pbest_fitipbest\_fit_{i} and pbest_posipbest\_pos_{i} by fitifit_{i} and posipos_{i}.
16:         end if\triangleright Step4
17:         if pbest_fiti>gbest_fitpbest\_fit_{i}\textgreater gbest\_fit then \triangleright Update global best
18:              Update gbest_posgbest\_pos and gbest_fitgbest\_fit by pbest_posipbest\_pos_{i} and pbest_fitipbest\_fit_{i}
19:         end if\triangleright Step5
20:     end for
21:end for

2.2. Parallel PSO Algorithm (PPSO)

In order to take advantage of the computing power of GPUs (graphics processing units), a parallel PSO (PPSO) has been developed based on the SPSO introduced in Section 2.1 to improve the execution speed of Steps 2-5 in Algorithm 1. The PPSO program is executed by CPU and GPU. Specifically, the particle information initialization (Step 1) and finalization (to obtain the final results) are done by the CPU, whereas the compute-intensive part (Steps 2-5) is performed on the GPU.

In PPSO (CHEN Feng, 2014), a thread is used to handle the computations for calculating the information of a particle, e.g., velocity, position, and fitness, as defined in Steps 2 and 3, where the workflow is illustrated in the ”1st kernel” (the first parallel code section) in Figure 1. The threads are grouped into blocks to update the best-known position and fitness for a group of particles, where the best data within each block is computed at the end of the ”1st kernel” in Figure 1. It is important to note that in order to speed up the process of obtaining the best data within each block, the parallel reduction operations are performed within each block (i.e., each thread group). Subsequently, the second parallel code (the ”2nd kernel”) is used to further derive the best position and fitness for the particle data across the thread groups (blocks), which is also performed by the parallel reductions. It is worthy to note that the first kernel code is performed by multiple thread blocks and each block performs the reductions internally to find the local best data points. On the contrary, the second kernel code is handed by a thread block to search for the global best data point from the output data derived by the thread blocks for the first kernel.

During the parallel reduction, it incurs extensive arithmetic operations for calculating the proper addresses of the reduction operands and the branching instructions in the reduction loops. A common approach is to unroll the reduction loops to reduce the above overheads since the addresses are computed offline and the sequential unrolled code avoids the branches. Nevertheless, unrolling the loops cannot remove the implicit synchronization between the first and second kernels, and during the iterative search process, the two kernels are invoked repeatedly, which introduces a significant amount of overheads. In this work, we develop the novel algorithm to tackle the problem.

Refer to caption
Figure 1. The workflow of parallel PSO algorithm.

3. Proposed Parallel Algorithms

3.1. PPSO with the Shared Memory Queue

In the practical test, we find that it is seldom the case that the conditions of Steps 4 and 5 in Algorithm 1 are satisfied; the possibility of the satisfied condition may be less than 0.1%. Specifically taking Step 4 as an example, the fitifit_{i} calculated by the ii-th thread is usually less than the pbest_fitipbest\_fit_{i} listed in line 14 of the algorithm. Based on the observation, we develop the shared memory based solution, which is referred to as the queue algorithm, to let each thread to update the queue (the in-order accessed arrays and the index to the arrays guarded by the atomic addition operation) if its calculated fitfit value is larger than that of the global value gbest_fitgbest\_fit. The major concept is the sequential accesses (enforced by the atomic add operation) to the shared memory arrays and is listed in line 1-4 in Algorithm 2. To further identify the best configuration of each block, the thread with its thread identifier equal to zero (i.e., threadIdx.x==0threadIdx.x==0) is selected to scan the entire queue, finds the local best configuration, and updates the local best information to the auxiliary arrays, as listed in line 7-20 in Algorithm 2. In particular, the local best information for a block (stored in bestFitQueue[0]bestFitQueue[0]) is updated to the auxiliary arrays indexed by the block identifier (auxFit[blockIdx.x]auxFit[blockIdx.x] and auxPos[blockIdx.x]auxPos[blockIdx.x]), as specified in line 17-18 Algorithm 2. Regarding the complexity of the step for updating the intra-block best, we reduce the complexity from 𝒪(logn)\mathcal{O}(\log{}n) of using the reduction operation to theoretical 𝒪(1)\mathcal{O}(1) of using the queue method222Thanks to the queue method that maintains the best-known fitness of each particle and performs the reduction operation if necessary, the resultant complexity is close to 𝒪(1)\mathcal{O}(1) at most of the time.. Note that the similar approach is also adopted when using the second kernel to update the global best information.

Algorithm 2 The queue algorithm.
1:if fit>gbest_fitfit>gbest\_fit then
2:     unsignedconstqIdxatomicAdd(&num,1);unsigned\;const\;qIdx\leftarrow atomicAdd(\&num,1);
3:     bestFitQueue[qIdx]fit;bestFitQueue[qIdx]\leftarrow fit;
4:     bestPosQueue[qIdx]pos;bestPosQueue[qIdx]\leftarrow pos;
5:end if
6:__syncthreads();
7:if idx<particle_cnt&&threadIdx.x==0idx<particle\_cnt\;\&\&\;threadIdx.x==0 then
8:     auxFit[blockIdx.x]INT_MIN;auxFit[blockIdx.x]\leftarrow INT\_MIN;
9:     auxPos[blockIdx.x]INT_MIN;auxPos[blockIdx.x]\leftarrow INT\_MIN;
10:     if numnum then
11:         for jj in numnum do
12:              if bestFitQueue[j]>bestFitQueue[0]bestFitQueue[j]>bestFitQueue[0] then
13:                  bestFitQueue[0]bestFitQueue[j]bestFitQueue[0]\leftarrow bestFitQueue[j]
14:                  bestPosQueue[0]bestPosQueue[j]bestPosQueue[0]\leftarrow bestPosQueue[j]
15:              end if
16:         end for
17:         auxFit[blockIdx.x]bestFitQueue[0];auxFit[blockIdx.x]\leftarrow bestFitQueue[0];
18:         auxPos[blockIdx.x]bestPosQueue[0];auxPos[blockIdx.x]\leftarrow bestPosQueue[0];
19:     end if
20:end if

3.2. PPSO with the Shared Memory Queue and the Global Lock

As described in Section 3.1, the local best information of each block is stored in the corresponding position of the auxiliary arrays (line 17-18 of Algorithm 2). This design allows the first threads of each block to store their best data to the auxiliary arrays concurrently. Later, the second kernel is used to further search for the global best information from within the auxiliary arrays. In such a design, it requires the two kernels to compute the global best configuration, as illustrated in Figure 1. In order to alleviate the synchronization overhead (between the two kernels) and shorten the time of launching the second kernel, we fuse the two kernels together via computing the best configuration (originally done in the second kernel) at the end of the first kernel. That is, Algorithm 3 is proposed to replace the line 17-18 of Algorithm 2.

Algorithm 3 reduces the execution time because each thread block compares its local best configuration against the global best one as soon as its local best configuration is available. In the fused kernel, each of the thread blocks finds its local best configuration and updates to the global configuration if necessary. Such a design eliminates the need to write to the auxiliary arrays and relaxes the synchronization of the thread blocks (i.e., the second kernel should be started after the threads in the first kernel are terminated), where a lock is introduced to ensure the global best information (gbest_fitgbest\_fit and gbest_posgbest\_pos) is accessed sequentially by utilizing the atomic lock. Therefore, it would deliver better performance, especially for the one dimensional data for gbest_fitgbest\_fit, which has no bad side effort.

Algorithm 3 The queue algorithm with atomic lock.
1:while(atomicCAS(lock, 0, 1) != 0);
2:if bestFitQueue[0]>gbest_fitbestFitQueue[0]>gbest\_fit then
3:     gbest_fitbestFitQueue[0];gbest\_fit\leftarrow bestFitQueue[0];
4:     gbest_posbestPosQueue[0];gbest\_pos\leftarrow bestPosQueue[0];
5:     __threadfence();
6:end if
7:atomicExch(lock, 0);

4. Implementation Remarks

4.1. Coalesce Memory Access

In general, the developers often use an array of structures (AoS) layout to design a program, since the AoS layout is easily understood and supported directly by most programming languages. However, AoS layout is almost the worst case scenarios to develop a parallel program as CUDA. On the other hand, structure of arrays (SoA) layout can reduce memory access time because each thread can do the coalesced access for continuous memory address. Accordingly, we change the data structure as the SoA layout in our implementation.

The data structure how we implement the SoA layout of the PSO algorithm. With the SoA adjustment, every element is seen as to be saved in an array with a fixed value, and it guarantees all threads of the same warp that could access consecutive values in global memory and obtain a better performance than AoS layout. In the high dimension case of the PSO algorithm, we follow the SoA rule in our implementation. As Figure 2, all threads accessing at the same dimension, which are still satisfied with the condition of the coalesced memory access, making the memory access each element efficiently.

\fname@algorithm AoS: Array of structures.
1:struct Particle{\{
2:       double position;
3:       double velocity;
4:       double fitness;
5:       double pbest_pos; \triangleright The local best-known position
6:       double pbest_fit; \triangleright The local best-known fitness
7:}\} particles[N];
\fname@algorithm SoA: Structure of arrays.
1:struct ParticleCoalescedMem{\{
2:       double position[N];
3:       double velocity[N];
4:       double fitness[N];
5:       double pbest_pos[N]; \triangleright The local best-known position
6:       double pbest_fit[N]; \triangleright The local best-known fitness
7:}\};
Refer to caption
Figure 2. The SoA layout of particle position in high dimension with coalesce memory access

4.2. Constant Memory

NVIDIA GPUs provide 64KB of constant memory that is treated differently from standard global memory. The constant memory has its own cache. When all threads access the same memory address simultaneously and threads in the same block re-access that address continuously, the constant memory can reach the most effective performance. Thus, using constant memory instead of global memory, which can conserve the memory bandwidth. Although the constant memory can complete the memory accessing task with the less latency, it constrains the data usage to be read-only. In our study case, we can place the parameters ww, c1c1, c2c2, max_posmax\_pos, min_posmin\_pos, max_vmax\_v, min_vmin\_v, particle_cntparticle\_cnt in constant memory, making all threads accessed to the constant memory in the same time, with an eye to reducing the memory latency time.

4.3. Shared Memory

The shared memory buffers reside physically on the GPU as opposed to residing in off-chip DRAM, which is much faster than the global memory. In fact, if there are no bank conflicts, the shared memory latency is about 100 times lower than the uncached global memory latency. Because the shared memory is shared by threads in a thread block, it has a mechanism for threads to cooperate. As subsection 2.2, using reduction would access the global memory frequently so that we can utilize shared memory to reduce the memory bandwidth latency. Similarly, shared memory can also be used on the GPU Queue method proposed at subsection 3.1. Owing to shared memory providing 48KB for each stream multiprocessors (SMs) only, we should store the index of the particle number rather than all axes positions on high dimension cases. After getting the best fitness of the particle, we can use the index to fill all axes position back, which not only saves the latency time but also reduces the shared memory usage to ensure maximum parallelization for SMs.

4.4. Random Number Generation

The PSO algorithm relies on the sequence of random numbers to find the best solution within the designated space. To generate the high-quality pseudorandom and quasirandom numbers on the GPU side, it is important to use the cuRAND (Corporation, 2010) library offered by the CUDA toolkit. While the users can provide the random number generator function on the GPU side by porting that function seen in the CPU side, it is time-consuming and inefficient for GPU execution since the custom-made library should be a thread-safe implementation and it may not take advantage from the low-level primitives offered by the CUDA toolkit. Based on our experiments, the cuRAND library outperforms the custom-made implementation on the PPSO program by a factor of 1.1. Therefore,in our experimental results, we adopt the cuRAND library (i.e., the curand_uniform_double()curand\_uniform\_double() function) to run the experiments by default.

5. Experimental Results

5.1. Experimental Setup

The hardware and software platforms used for the experiments are listed in Table 2. The five PSO algorithms have been implemented and tested in the experiments, 1) CPU for the serial version defined in Section 2.1, 2) Reduction for the parallel reduction method introduced in Section 2.2, 3) Loop Unrolling for the loop unrolling method introduced in Section 2.2, 4) Queue for the proposed queue algorithm defined in Section 3.1, and 5) Queue Lock for the proposed queue lock algorithm defined in Section 3.2, where all the algorithms are running on the GPU, except for the first serial version. The double precision floating point numbers are adopted in the PSO implementations. Each of the reported data is the average numbers of the execution time for 10 runs, removing the maximum and minimum numbers. In order to observe the impacts of the improved algorithms, the running time at the GPU side is measured and reported. The above implementations are used to solve the problems with both the low and high dimension space, denoted as 1D problem and 120D problem, respectively.

Table 2. The hardware and software platforms for the experiments.
Name Description
CPU Intel Xeon CPU E3-1275 v5, 3.60GHz
GPU Nvidia GTX-1080ti, 1481 MHz, 3584 CUDA cores, Compute Capability 6.1
OS Ubuntu 18.04.4 LTS (kernel version 5.4.0-42-generic)
CUDA CUDA Toolkit 11.2

Fitness Function

An important design parameter of the PSO algorithm is the fitness function. While there are several different functions to be used, such as Sphere, Rosenbroc, Griewang, we choose the Cubic function as the fitness function for all the experiments, as shown in Equation 3, since it requires slightly higher computation complexity (e.g., higher than Sphere). The parameter ww used by the fitness function is set as 1 and learning factor c1c1 and c2c2 as 2, which are commonly seen settings.

(3) f=i=1dxi30.8xi21000xi+8000,100xi100f=\sum_{i=1}^{d}x_{i}^{3}-0.8x_{i}^{2}-1000x_{i}+8000,\quad-100\leq x_{i}\leq 100

5.2. 1D Problem

Table 3. The execution times of the five implementations on the 1D problem.
Particles Iteration CPU (s) GPU Reduction (s) GPU Loop Unrolling (s) GPU Queue (s) GPU Queue Lock (s)
32 100,000 0.100 0.413 0.394 0.368 0.216
64 100,000 0.187 0.419 0.402 0.368 0.219
128 100,000 0.385 0.447 0.408 0.371 0.220
256 100,000 0.825 0.455 0.419 0.371 0.222
512 100,000 1.503 0.467 0.422 0.391 0.223
1,024 100,000 3.042 0.491 0.439 0.394 0.227
2,048 100,000 6.277 0.508 0.451 0.409 0.230

As listed in Table 3, the parallel algorithms running on the GPU outperform the serial version on the CPU. The execution time of the CPU version increases almost linearly with the growth of the computation load (related to the number of particles and iteration). On the contrary, the parallel versions have relatively stable execution times across the different workloads since the problem size is relatively small which can be covered by GPU. Specifically, each particle is assigned with a GPU thread to handle the required computation. The maximum number of particles is 2,048 in this experiment, and the Nvidia GPU has 3,584 CUDA threads. The ranking, in terms of the execution time, of the five algorithms are plotted in Figure 3. Obviously, the Queue Lock algorithm achieves the best performance than all other parallel versions, and the Queue algorithm is the second best choice. Particularly, the Queue Lock algorithm is 2.2 times faster than the Reduction algorithm, which is the state-of-the-art work. Table 4 shows the speedup ratio of the serial version running on the CPU to the Queue Lock algorithm on the GPU. It shows a maximum of 195x speedup when handling 65,536 particles, and the speedup drops for handling 131,072 particles. Based on our analysis, the GPU is overloaded in such a case, where the GPU capacity is full and it cannot handle the workload efficiently (context switching among thread groups do not help the performance).

Refer to caption
Figure 3. Plotting of the execution times of the five implementations with different number of particles involved.
Table 4. The speedups achieved by the Queue Lock algorithm on the 1D problem.
Particles Iteration CPU (s) GPU QueueLock(s) Speedup Ratio
128 100,000 0.385 0.220 1.75
256 100,000 0.825 0.222 3.71
512 100,000 1.503 0.223 6.73
1,024 100,000 3.042 0.227 13.40
2,048 100,000 6.277 0.230 27.29
4,096 100,000 12.410 0.265 46.83
8,192 100,000 23.850 0.316 75.47
16,384 100,000 47.355 0.417 113.56
32,768 100,000 94.629 0.643 147.16
65,536 100,000 200.536 1.026 195.45
131,072 100,000 378.671 2.759 137.24

5.3. 120D Problem

We configure the PSO algorithms to handle the problem defined by the 120 dimensional space. In the experiment setup, we select the Queue algorithm to compare against the serial version since the time saved by the Queue Lock algorithm is negligent in the high dimension space since the execution of the first kernel dominate the total execution time and the synchronization overhead saved by the Queue Lock is relatively small. Hence, we think the Queue algorithm is a better solution for the high dimension problem. The speedups achieved by the Queue algorithm are listed in Table 5. Because of a higher dimension space, the peak speedup is achieved (225x) when the number of the particles is 32,768, which is only the half of the number (65,535) reported in Table 4.

Table 5. The speedups achieved by the Queue algorithm on the 120D problem.
Particles Iteration CPU (s) GPU Queue(s) Speedup Ratio
128 5,000 2.392 0.487 4.91
256 4,000 3.543 0.384 9.22
512 3,000 5.305 0.288 18.42
1,024 2,000 7.078 0.225 31.45
2,048 2,000 14.214 0.255 55.74
4,096 1,500 21.593 0.220 98.15
8,192 1,000 29.494 0.191 154.41
16,384 1,000 59.125 0.294 201.10
32,768 1,000 128.349 0.570 225.17
65,536 1,000 237.933 1.169 203.53
131,072 800 379.820 1.744 217.78

6. Conclusion and Future Work

In this work, we propose the queue-based algorithms to shorten the execution time for solving the PSO problem. We have shown the key ideas of the parallelizing algorithms for GPUs. Our experimental results show that our proposed algorithms can effectively improve the execution efficiency for solving the PSO problems, achieving 200x speedup compared with the serial version and 2.2x speedup compared with the parallel reduction based method. With the encouraging results, in the future, we would like to further improve the performance of the queue-based approach with the asynchronous execution scheme and to extend the algorithm for the multiple GPU version so as to handle a larger size of PSO problems.

References

  • (1)
  • PSO (2001) 2001. Appendix A - Statistics for Swarmers. In Swarm Intelligence, James Kennedy, Russell C. Eberhart, and Yuhui Shi (Eds.). Morgan Kaufmann, San Francisco, 429–449. https://doi.org/10.1016/B978-155860595-4/50012-7
  • CHEN Feng (2014) YANG M in CHEN Feng, TIAN Yu-bo. 2014. Research and Design of Parallel Particle Swarm Optimization Algorithm Based on CUDA. (2014). http://www.jsjkx.com/CN/article/openArticlePDF.jsp?id=1649
  • Corporation (2010) NVIDIA Corporation. 2010. NVIDIA. CUDA CURAND Library. (2010). https://www.cs.cmu.edu/afs/cs/academic/class/15668-s11/www/cuda-doc/CURAND_Library.pdf
  • Harris ([n. d.]) Mark Harris. [n. d.]. Optimizing Parallel Reduction in CUDA. ([n. d.]). https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
  • Kennedy and Eberhart (1995) J. Kennedy and R. Eberhart. 1995. Particle swarm optimization. In Proceedings of ICNN’95 - International Conference on Neural Networks, Vol. 4. 1942–1948 vol.4. https://doi.org/10.1109/ICNN.1995.488968
  • Krink and Løvbjerg (2002) Thiemo Krink and Morten Løvbjerg. 2002. The LifeCycle Model: Combining Particle Swarm Optimisation, Genetic Algorithms and HillClimbers. In Parallel Problem Solving from Nature — PPSN VII, Juan Julián Merelo Guervós, Panagiotis Adamidis, Hans-Georg Beyer, Hans-Paul Schwefel, and José-Luis Fernández-Villacañas (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 621–630.
  • Niknam and Amiri (2010) Taher Niknam and Babak Amiri. 2010. An efficient hybrid approach based on PSO, ACO and k-means for cluster analysis. Applied Soft Computing 10, 1 (2010), 183–197. https://doi.org/10.1016/j.asoc.2009.07.001
  • Poli (2008) Riccardo Poli. 2008. Analysis of the Publications on the Applications of Particle Swarm Optimisation. Journal of Artificial Evolution and Applications 2008 (02 2008), 10. https://doi.org/10.1155/2008/685175
  • Venayagamoorthy and Gudise (2004) Ganesh Venayagamoorthy and V.G. Gudise. 2004. Swarm intelligence for digital circuits implementation on field programmable gate arrays platforms. 83– 86. https://doi.org/10.1109/EH.2004.1310813
  • Veronese and Krohling (2009) Lucas de P. Veronese and Renato A. Krohling. 2009. Swarm’s flight: Accelerating the particles using C-CUDA. In 2009 IEEE Congress on Evolutionary Computation. 3264–3270. https://doi.org/10.1109/CEC.2009.4983358
  • Wang et al. (2008) Dazhi Wang, Chun-Ho Wu, Andrew Ip, Dingwei Wang, and Yang Yan. 2008. Parallel multi-population Particle Swarm Optimization Algorithm for the Uncapacitated Facility Location problem using OpenMP. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence). 1214–1218. https://doi.org/10.1109/CEC.2008.4630951
  • Yamada et al. (2010) Takahiro Yamada, Yutaka Maeda, Seiji Miyoshi, and Hiroomi Hikawa. 2010. Simultaneous perturbation particle swarm optimization and FPGA realization. In Proceedings of SICE Annual Conference 2010. 1462–1465.
  • Zhang and Xie (2003) Wen-Jun Zhang and Xiao-Feng Xie. 2003. DEPSO: hybrid particle swarm with differential evolution operator. In SMC’03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483), Vol. 4. 3816–3821 vol.4. https://doi.org/10.1109/ICSMC.2003.1244483