April 25, 2025
Conference Paper
DaYu: Optimizing Distributed Scientific Workflows by Decoding Dataflow Semantics and Dynamics
Abstract
The combination of ever-growing scientific datasets and distributed workflow complexity creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the increasing use of descriptive data formats (e.g., HDF5, netCDF) helps organize these datasets, it also creates obscure bottlenecks due to the need to translate high level operations into file addresses and then into low-level I/O operations. To address this challenge, we introduce DaYu, a method and toolset for analyzing (a) semantic relationships between logical datasets and file addresses, (b) how dataset operations translate into I/O, and (c) the combination across entire workflows. DaYu's analysis and visualization enables identification of critical bottlenecks and reasoning about remediation. We describe our methodology and propose optimization guidelines. Evaluation on scientific workflows demonstrates up to 3.7x performance improvements in I/O time for obscure bottlenecks. The time and storage overhead for DaYu's time-ordered data is typically under 0.2% of runtime and 0.25% of data volume, respectively.Published: April 25, 2025