Support Vector Machine Implementation on MPI-CUDA and Tensorflow Framework
Abstract
Support Vector Machine (SVM) algorithm requires a high computational cost (both in memory and time) to solve a complex quadratic programming (QP) optimization problem during the training process. Consequently, SVM necessitates high computing hardware capabilities. The central processing unit (CPU) clock frequency cannot be increased due to physical limitations in the miniaturization process. However, the potential of parallel multi-architecture, available in both multi-core CPUs and highly scalable GPUs, emerges as a promising solution to enhance algorithm performance. Therefore, there is an opportunity to reduce the high computational time required by SVM for solving the QP optimization problem. This paper presents a comparative study that implements the SVM algorithm on different parallel architecture frameworks. The experimental results show that SVM MPI-CUDA implementation achieves a speedup over SVM TensorFlow implementation on different datasets. Moreover, SVM TensorFlow implementation provides a cross-platform solution that can be migrated to alternative hardware components, which will reduces the development time.
Index Terms:
Machine Learning (ML), Support vector machine (SVM), quadratic programming (QP), central processing unit (CPU), message passing interface (MPI), and compute unified device architecture (CUDA).I Introduction
Recently, machine learning (ML) has been used in many different fields of our life. ML aims to allow machines to learn as humans do. There are many different algorithms that aim to manipulate offline dataset information to extract knowledge, enabling machines to learn how to predict new information. ML algorithms aim to learn a function that maps input data x onto output . The core of any ML algorithm is to learn a function that allows it to identify and map new input to its output .
Support Vector Machine (SVM) is one of the core algorithms in ML. SVM is widely used in both classification and regression processes. SVM is a binary classifier. One-to-One and One-to-Many are two approaches to implementing multiclass. SVM requires high computational time, intensive memory, and storage requirements to solve large-scale problems. However, the performance of the central processing unit (CPU) cannot be increased due to limitations of clock frequencies. Therefore, significant improvements have been made through high-scalable graphical processing unit (GPU) parallel architecture and multi-core processors. These improvements provide a chance to overcome the problems that require high computational cost [1, 2, 3, 4].
Message Passing Interface (MPI) is a standardized means of communication between different nodes across distributed memory. Compute Unified Device Architecture (CUDA) from NVIDIA was developed as a parallel computing platform and programming model to leverage the power of the GPU. A hybrid model combining both MPI-CUDA can be used to best utilize the hardware capabilities [4, 5]. TensorFlow is an open-source machine learning library that allows application program interfaces (APIs) for implementing ML algorithms on a variety of platforms (CPU and GPU) for desktop, mobile, web, and cloud [6].
The rest of this paper is organized as follow. Section II introduces the hybrid message passing interface - compute unified device architecture (MPI-CUDA) and Tensorflow framework. The introduction about SVM algorithm and our proposed technique design is presented and discussed in section III. Finally, the experimental results and datasets are described in section IV, and the conclusion is provided in section V.
II Parallel Architecture
II-A Message Passing Interface - Compute United Device Architecture (MPI-CUDA)
MPI is a standard interface for communication between different nodes in distributed memory, where each node has its local address space and runs independently. MPI provides node-to-node communication and collective communication through the communication network between the nodes. The MPI programming model follows the Single Program Multiple Data (SPMD) model, where a set of nodes executes the same instructions on different data. The instructions are loaded onto all nodes, but each node can follow a different execution path. There are many implementations that follow the MPI standard; the most popular of them are MPICH and LAMMPI [5].
CUDA is an API used for leveraging GPU parallel architecture in General-Purpose Programming (GPGPU). NVIDIA introduced the CUDA API for the first time in November 2006 to overcome the limitations of using GPUs in GPGPU due to the difficulty of knowing and interacting with DirectX or OpenGL. In the CUDA programming model, the GPU can be considered as a massive parallel Single Instruction Multiple Data (SIMD) Streaming Multiprocessor (SM) array. A group of grid/block/thread can be assigned to execute a kernel [7].
A kernel is a program to be executed with each thread, using thread identifier and block identifier. Each thread can perform the kernel task on a different set of data in parallel. The CPU, or ’host,’ is responsible for launching the kernel to be executed on the GPU, or ’device.’ It also sends input kernel data from its memory ’host memory’ to the GPU ’device memory. Then, it copies the result back from ’device memory’ to ’host memory’ Each SM has its own shared memory, which is common to all the processors inside it. It also has constant memory caches. Communication between different SMs occurs through device memory, which is available to all the processors of the SM [7, 8, 9].
MPI represents a distributed memory architecture, while CUDA represents a shared memory architecture. The hybrid memory architecture combines both distributed and shared memory architectures. Each node has its own multicore processor that can be utilized as part of a shared architecture memory model. Different nodes are considered part of a distributed memory architecture, communicating through a network and using various communication directives to exchange messages among them [10, 11]. This is illustrated in Fig.1.

II-B Tensorflow Framework
Recently, many efficient and easy-to-use ML frameworks have been introduced to accommodate the increasing complexity of ML algorithms. They allow beginners and experts to develop their own ML techniques. TensorFlow is one such ML framework library. TensorFlow is an open-source ML library interface for expressing ML algorithms and executing predefined ML algorithms
TensorFlow is used for high-performance numerical computation. Its flexible architecture allows easy implementation and deployment on cross-platforms (desktops, clusters of servers, CPUs, GPUs, TPUs, and mobile devices), and it can run on several operating systems [6]. TensorFlow is an expression where a tensor is data, and flow is the expressing flow for this data in a dataflow graph (). A dataflow graph is a form of a directed graph consisting of a set of nodes and edges between them, where nodes describe instructions, and edges represent the data flow [6]. TensorFlow implementations are represented in two steps; the first of them is constructing a dataflow graph, the second uses a session to run instructions in this graph and evaluate the output result. The session is the runtime environment of a graph, where instructions are executed, and tensors are calculated, as shown in Fig.2.

III Support Vector Machine algorithm and Proposed technique design
III-A Sequential SVM
SVM is binary classifier ML algorithm. Given a set of training data samples , where n is the size of , and d input feature size, is the class label of input sample . Assuming classes are linearly separable, SVM aim to find a hyperplane that gives the maximum margin between two classes of data in . ”Support vectors” are the vectors of both classes located on the margin hyperplane [12]. In most cases, training data samples are not linearly separable, so this hyperplane is not exist. Using Kernel function gives advantage to SVM for non-linearly separable classes [13]. However SVM is a binary classifier technique, but it can be used as multi-class classifier. The “one-against-one” is the more suitable method for practical use than other methods. For classes, there are independent binary classification problems [14].
III-B MPI-CUDA SVM
Sequential minimum optimization (SMO) algorithm used to solve the SVM binary training quadratic problem, where training data samples are divided into smaller working set. SMO solves the smallest task at each step to find only two optimization variables and update two Lagrange multipliers under the Karush-Kuhn-Tucker (KKT) constraints [15, 16, 17].
In [18, 19], the SMO is implemented in parallel, making full use of the advantage of the huge computational power of GPU devices. In MPI-CUDA implementation [20], we launched a thread per independent training data sample. Hence, most of the SMO computation steps can run in parallel. Then, convergence checks were executed on the host for every set of iterations on the device, as shown in Fig.3. Moreover, Fig.4 illustrates the use of Hybrid MPI-CUDA to execute the multi-class training problem. It involves running multiple parallel binary SMOs to implement a parallel multi-class SMO. We distribute the number of parallel binary SMOs among the MPI working nodes [20].


III-C Tensorflow SVM
As with many implementations of algorithms in TensorFlow, SVM is described as a directed graph consisting of nodes and edges, where nodes represent instructions (operations), and edges represent data flow. A ’Variable’ is a special instruction (operation) without computation; it is used only for storing any parameters of a described algorithm. A ’Placeholder’ is a tensor that will always be fed. After graph construction, a TensorFlow session is used to execute the graph and compute the result [21, 22].
For the SVM binary training problem, the directed graph consists of three steps. The first of them is initializing the input training data samples using ’Placeholders’. The second step is defining SVM ’Variables’ and describing the Gaussian RBF kernel function. The third step is describing the SVM model and declaring the gradient descent optimizer algorithm, as shown in Fig.5. After describing the graph for SVM binary training process, a TensorFlow session is used to run it and compute the output result. Multiple running sessions are used to implement the SVM multiclass training process with the ’one-against-one’ approach.

IV Datasets & Experimental Results
In this section, we present the datasets used to compare SVM implementations on both CUDA and Tensorflow. It highlights the performance of the CUDA implementation over the Tensorflow implementation. It also shows the cross-platform nature of Tensorflow implementation, demonstrating its ability to be migrated to alternative hardware components.
IV-A Datasets
Table I shows the experimental datasets. The first dataset is a hyperspectral dataset consisting of 1096x715 pixels and nine ground-truth classes labeled as water, trees, grass, parking lot, bare soil, asphalt, bitumen, tiles, and shadow. The second dataset is the Iris flower multivariate dataset with 150 total data samples and 3 classes (setosa, virginica, versicolor). The third dataset is the Breast Cancer Wisconsin dataset with 569 data samples and 2 possible classes (benign, malignant).
Dataset | Description | #Classes | #Features | |||
Pavia Centre |
|
9 | 102 Spectral bands | |||
Iris Flower |
|
3 | 4 Features | |||
Breast Cancer |
|
2 | 32 Features |
IV-B Experimental Results
Table II shows the hardware specifications for the machine used to carry out these experiments. CUDA Version 9.0 and MPICH2 are used to implement SVM MPI-CUDA, while SVM Tensorflow is implemented using TensorFlow version 1.8.0. Both implementations run on Windows 10 (64-bit).
For the Pavia Centre dataset, Table III and Fig.6 present the binary training times at various sample sizes for a binary class SVM. This comparison involves the CUDA-GPU implementation against the Tensorflow-GPU implementation. Specifically, CUDA-GPU achieved a speedup of 154.3X over Tensorflow-GPU with 800 sample points for each class. Similarly, Table IV and Fig.7 present the training times for multiclass SVM on different sample sizes using the hybrid MPI-CUDA implementation over Multi-Tensorflow for the same dataset. MPI-CUDA achieved a speedup of 14.9X over Multi-Tensorflow with 800 sample points per each class.
CPU Core i7-7500M @2.70 GHz ,2.9 GHz | |
Host(CPU) | 16 GB RAM |
NVIDIA GeForce GTX950M | |
Device(GPU) | 5 SMP , 640 Cores |
TrainingTime (second) | |||
Pavia Dataset #Trainingsamples/#classes | CUDA-GPU | Tensorflow-GPU | Speedup |
200/2 | 0.017667 | 2.0345 | 115.2x |
400/2 | 0.019695 | 2.43 | 123.4x |
600/2 | 0.02487 | 3.09 | 124.2x |
800/2 | 0.02797 | 4.315 | 154.3x |


TrainingTime (second) | |||
Pavia Dataset #Trainingsamples/#classes | MPI-CUDA | Multi-Tensorflow | Speedup |
200/9 | 8.4855 | 82.762 | 9.8x |
400/9 | 9.13105 | 96.72 | 10.6x |
600/9 | 9.6268 | 120.32 | 12.5x |
800/9 | 10.688 | 157.97 | 14.9x |
TrainingTime (second) | |||
Dataset (#datapoint/#features/#classe) | CUDA-GPU | Tensorflow-GPU | Speedup |
Iris flower (40/4/2) | 0.018 | 1.125 | 60.5x |
Breast Cancer (190/32/2) | 0.0233 | 2.746 | 117.9x |
TrainingTime (second) | ||
Dataset (#datapoint/#features/#classe) | Tensorflow–CPU | Tensorflow-GPU |
Iris flower (40/4/2) | 3.09 | 1.125 |
Breast Cancer (190/32/2) | 4.65 | 2.746 |
Table V shows the training times for the Iris Flower dataset with 2 classes, where CUDA-GPU achieved a speedup of 60.5X over Tensorflow-GPU. Additionally, for the Breast Cancer dataset, CUDA-GPU achieved a speedup of 117.9X. The overall results indicate that the speedup is proportionally related to the training datapoint size. Utilizing the CUDA API provides explicit control over the number of worker streaming multiprocessors (SM), thread/block, and memory management, as opposed to the implicit control in Tensorflow. Due to this explicit control, the SVM CUDA implementation demonstrates a speedup over the SVM Tensorflow implementation in the tested datasets.
Furthermore, the MPI communication overhead between nodes in the MPI-CUDA implementation may influence the overall performance, but it is only required for transferring input data at the beginning of the execution and sending the result data at the end of execution. There is no communication needed during the execution, resulting in a small overhead on system performance. Despite the MPI communication in SVM MPI-CUDA, this implementation achieved a speedup over the Multi-Tensorflow implementation in all the tested datasets.
Finally, Table VI shows the training time for the Iris Flower dataset and the Breast Cancer dataset on Tensorflow-CPU and Tensorflow-GPU. There is no change in running the same Tensorflow implementation on both CPU and GPU, as Tensorflow has a flexible architecture that allows easy deployment over different platforms.
V Conclusion
SVM is one of the most popular algorithms for the classification process in the ML field. Many frameworks are used to implement SVM, and in this paper, we compared two of them: CUDA and Tensorflow, on three different datasets (Pavia Centre, Iris Flower, Breast Cancer). The comparison between CUDA-GPU and Tensorflow-GPU for binary-class SVM demonstrates that the training process on CUDA-GPU achieved a speedup of 154.3X over Tensorflow-GPU for the Pavia Centre dataset with 800 sample datapoints per class and 102 features. CUDA-GPU also reached a speedup of 60.5X for the Iris Flower dataset with 40 sample datapoints per class and 4 features, and for the Breast Cancer dataset, the speedup reached 117.9X for CUDA binary implementation over Tensorflow implementation. For multi-class SVM, the MPI-CUDA implementation achieved a speedup of 14.9X over Multi-Tensorflow-GPU on the Pavia Centre dataset with 800 sample datapoints per class and 102 features, where the total number of classes is 9. Experimental results derived from three datasets show that the explicit control in CUDA API provides a speedup over the implicit control in Tensorflow. However, Tensorflow reduces development time and has the ability to be migrated to alternative hardware components.
References
- [1] Michael J Flynn. Some computer organizations and their effectiveness. IEEE transactions on computers, 100(9):948–960, 1972.
- [2] Jian Cao, Minglu Li, Min-You Wu, and Jinjun Chen. Network and Parallel Computing: IFIP International Conference, NPC 2008, Shanghai, China, October 18-20, 2008, Proceedings, volume 5245. Springer, 2008.
- [3] Mohammad Osman Tokhi, M Alamgir Hossain, and M Hasan Shaheed. Parallel computing for real-time signal processing and control. Springer Science & Business Media, 2003.
- [4] Ron Bekkerman, Mikhail Bilenko, and John Langford. Scaling up machine learning: Parallel and distributed approaches. Cambridge University Press, 2011.
- [5] Yukiya Aoyama, Jun Nakano, et al. Rs/6000 sp: Practical MPI programming. IBM Poughkeepsie, New York, 1999.
- [6] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: a system for Large-Scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
- [7] Shane Cook. CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes, 2012.
- [8] CUDA Nvidia. Nvidia cuda c programming guide, version 4.2. NVIDIA: Santa Clara, CA, 66, 2010.
- [9] Jason Sanders and Edward Kandrot. CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional, 2010.
- [10] Heba Khaled, Hossam El Deen Mostafa Faheem, and Rania El Gohary. Design and implementation of a hybrid mpi-cuda model for the smith-waterman algorithm. International journal of data mining and bioinformatics, 12(3):313–327, 2015.
- [11] Qing-kui Chen and Jia-kang Zhang. A stream processor cluster architecture model with the hybrid technology of mpi and cuda. In 2009 First International Conference on Information Science and Engineering, pages 86–89. IEEE, 2009.
- [12] Chih-Chung Chang and Chih-Jen Lin. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
- [13] Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. Fast support vector machine training and classification on graphics processors. In Proceedings of the 25th international conference on Machine learning, pages 104–111, 2008.
- [14] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. IEEE transactions on Neural Networks, 13(2):415–425, 2002.
- [15] Bernhard Schölkopf, Christopher JC Burges, and Alexander J Smola. Advances in kernel methods: support vector learning. MIT press, 1999.
- [16] Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin, and Thorsten Joachims. Working set selection using second order information for training support vector machines. Journal of machine learning research, 6(12), 2005.
- [17] S. Sathiya Keerthi, Shirish Krishnaj Shevade, Chiranjib Bhattacharyya, and Karuturi Radha Krishna Murthy. Improvements to platt’s smo algorithm for svm classifier design. Neural computation, 13(3):637–649, 2001.
- [18] Noel Lopes, Bernardete Ribeiro, Noel Lopes, and Bernardete Ribeiro. Gpu machine learning library (gpumlib). Machine Learning for Adaptive Many-Core Machines-A Practical Approach, pages 15–36, 2015.
- [19] Kun Tan, Junpeng Zhang, Qian Du, and Xuesong Wang. Gpu parallel implementation of support vector machines for hyperspectral image classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(10):4647–4656, 2015.
- [20] I Elgarhy, Rania El Gohary, and HM Faheem. Multi-class support vector machine training and classification based on mpi-gpu hybrid parallel architecture. In Proceedings of the International Conference on Advanced Intelligent Systems and Informatics 2018 4, pages 179–188. Springer, 2019.
- [21] Peter Goldsborough. A tour of tensorflow. arXiv preprint arXiv:1610.01178, 2016.
- [22] Yu Yuan, Kushal Virupakshappa, Yiyue Jiang, and Erdal Oruklu. Comparison of gpu and fpga based hardware platforms for ultrasonic flaw detection using support vector machines. In 2017 IEEE International Ultrasonics Symposium (IUS), pages 1–4. IEEE, 2017.