May 16, 2025
Conference Paper

Investigating Resilience of Loops in HPC Programs: A Semantic Approach with LLMs

Abstract

Soft errors have become one of the major concerns for the error resilience of the HPC applications as those errors may cause HPC applications to generate serious outcomes such as silent data corruptions (SDCs). Protecting the applications from soft errors is an essential while challenging task. Among different approaches, obtaining a profound understanding of the resilience proneness of an application is very important to devise efficient error detection and recovery strategies. Given the scale of the HPC applications both in the code size and execution time, there are often cases that the error propagation analysis on such applications would produce a massive volume of unstructured data, which requires a significant amount of efforts, to process and to obtain indicating actions towards error protection. In this paper, we present a control-flow based visual analysis framework to help the users conduct error propagation analysis and identify the critical sections of a program that may have a higher likelihood of leading to erroneous outcomes when affected by the control flow related errors. We also design and implement the scalable visualization framework - ResilienceVis that efficiently and effectively visualizes the affected program states under errors and the propagation traces for an application in a user-friendly manner, and eventually, we combine the analysis and visualization to exhibit the error-proneness of the different sections of applications.

Published: May 16, 2025

Citation

Jiang H., J. Zhu, B. Fang, C. Chen, and Q. Guan. 2024. Investigating Resilience of Loops in HPC Programs: A Semantic Approach with LLMs. In IEEE High Performance Extreme Computing (HPEC 2024), September 23-27, 2024, Wakefield, MA, 1-10. Piscataway, New Jersey:IEEE. PNNL-SA-162478. doi:10.1109/HPEC62836.2024.10938472