Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminate the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.
Revised: January 21, 2016 |
Published: September 8, 2015
Citation
Vishnu A., J. Narasimhan, L. Holder, D.J. Kerbyson, and A. Hoisie. 2015.Fast and Accurate Support Vector Machines on Large Scale Systems. In IEEE International Conference on Cluster Computing (CLUSTER 2015), September 8-11, 2015, Chicago, Illinois, 110-119. Piscataway, New Jersey:IEEE.PNNL-SA-110940.doi:10.1109/CLUSTER.2015.26