May 15, 2025
Conference Paper

Improving I/O-aware Workflow Scheduling via Data Flow Characterization and trade-off Analysis

Abstract

The scientific computing paradigm has transitioned from compute-intensive to I/O-intensive and memory-intensive in the past decade, especially when data-driven science has become common practice. Numerous empirical I/O-aware scheduling optimizations have been developed by incorporating I/O capacity and bandwidth as constraints into scheduling. Unfortunately, there is a lack of data flow (I/O) characterization tool and an understanding of trade-offs between concurrency, locality, and I/O bandwidth. To bridge the gap, this work 1) presents a set of descriptors to characterize, organize, and visualize I/O profiles, including flow size, I/O bandwidth, and operation count, which group data flows by I/O types, tasks, and files; 2) proposes an I/O Roofline model-based trade-off analysis to find the optimal trade-off between flow operational intensity, concurrency, and flow performance. The I/O descriptors generate useful insights into complicated I/O behaviors, suggesting distinct concurrency, storage, and scheduling to be used by types, tasks, and files. The proposed trade-off analysis guides scheduling decisions that generate resource assignment with the best flow parallelism. We evaluate our I/O-aware scheduling methodology on a highly I/O-intensive workflow–1000 Genomes. The experimental results demonstrate speedups of up to 2.4× compared to the state-of-the- art methods.

Published: May 15, 2025

Citation

Guo L., M. Tang, H. Lee, J.S. Firoz, and N.R. Tallent. 2024. Improving I/O-aware Workflow Scheduling via Data Flow Characterization and trade-off Analysis. In IEEE International Conference on Big Data (BigData 2024), December 15-18, 2024, Washington, D.C., 3674-3681. Piscataway, New Jersey:IEEE. PNNL-SA-205879. doi:10.1109/BigData62323.2024.10825855

Research topics