Scientific workflows orchestrate the execution of computational methods that revolutionized the scientific discovery process. For example, simulations revealed previously unknown catalyst active sites and elucidated reaction mechanisms in materials science. These workflows are at the core of Department of Energy’s strategy for Integrated Research Infrastructure. Validating computations and theory with experimentation, and vice versa, has proven to be a powerful technique, leading to the discovery of a variety of new materials in catalysis, energy storage, and more.
Automating the complex theory-experiment cycle though scientific workflows can further accelerate this process. This would mean coordinating large computational models with hypothesis generation and instrument control, as well as experimental interpretation and feedback. However, the large amount of information that would need to be passed along in these workflows can cause storage and network bottlenecks.
To alleviate this problem, researchers from Pacific Northwest National Laboratory (PNNL) and the Illinois Institute of Technology developed DataLife—a measurement and analysis toolset for workflows. Their results were presented as a technical paper at SC23, the International Conference for High Performance Computing, Networking, Storage, and Analysis in Denver, Colorado.
“Data transfer is like traffic flow,” said Nathan Tallent, PNNL computer scientist and 2021 Early Career Research Program award recipient. “Too much data through a network or storage system at once can cause delays in a workflow.”
Tallent and his team—including PNNL researchers Hyungro Lee, Luanzheng Guo, and Jesun Firoz—created DataLife to analyze the data flow lifecycle of a workflow and identify and visualize bottlenecks within it. This allows researchers to identify and rank opportunities for improving workflow processes, such as data placement and resource assignment. They then applied DataLife to three well-known workflows—1,000 Genomes, DeepDriveMD, and Belle II Monte Carlo—to evaluate the tool. Their results showed an improvement in response time by 15×, 1.9×, and 10-30×, respectively, for each of the workflows.
“Through DataLife, we could identify and subsequently address data flow issues in our workflows,” said Tallent. “This type of insight will become more crucial as similar workflows start to appear in experimental science, such as with autonomous instrumentation.”
This work was supported by the Department of Energy, Advanced Scientific Computing Research Program, Orchestration for Distributed and Data-Intensive Scientific Exploration, and by the PNNL Laboratory Directed Research and Development initiative, Cloud, HPC, and Edge for Science and Security. Meng Tang, Anthony Kougkas, and Xian-He Sun of the Illinois Institute of Technology also co-authored the SC23 paper.