April 26, 2025
Conference Paper

Performance Analysis of Data Processing in Distributed File Systems with Near Data Processing

Abstract

In the era of big data, the escalating volume and velocity of data generation pose significant challenges in data processing. Traditional systems like Spark and Hadoop manage the increasing amount and velocity of data by improving data placement and processing speeds. However, they face inherent limitations due to the essential data movement required for processing. In this paper, we explore the Skyhook framework, a novel extension of the Ceph distributed system, which significantly reduces the need for data movement. We present an extensive case study using the Skyhook framework, applying it with the TPC-H and K-means clustering algorithms. More specifically, we leverage the TPC-H benchmark to distinguish between CPU-intensive and I/O-intensive tasks. We explore the integration of K-means clustering into SQL, coupled with a near-data processing system to offload the computational burden of the K-means clustering algorithm to storage nodes. We conduct a comprehensive performance evaluation of distributed data processing applications across three processing approaches: traditional layout (baseline), optimized layout, and near-data processing. Additionally, we introduce the use of the FIO tool to simulate real-world system workloads, enabling the measurement of performance metrics such as average latency and CPU utilization. Our research is a significant advance in understanding how to optimize data processing systems to meet the demands of the modern data landscape.

Published: April 26, 2025

Citation

Hou S., N.R. Tallent, W. Li, and N. Mi. 2024. Performance Analysis of Data Processing in Distributed File Systems with Near Data Processing. In International Symposium on Networks, Computers and Communications (ISNCC 2024), October 22-25, 2024, Washington, D.C., 1-6. Piscataway, New Jersey:IEEE. PNNL-SA-202854. doi:10.1109/ISNCC62547.2024.10758994