Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

March 29, 2023

Conference Paper

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

Abstract

Graphics Processing Units (GPUs), the dominantly adopted accelerators in HPC systems, are susceptible to transient hardware fault. New generation of GPUs feature mixed-precision architectures such as NVIDIA Tensor Cores to accelerate matrix multiplications. While being widely adapted, how would they behave under transient hardware faults remain unclear. In this study, we conduct a large-scale fault injection experiments on GEMM kernels implemented with different floating-point data types on the V100 and A100 Tensor Cores, and show distinct error resilience characteristics for the GEMMS with different formats. In the future, we plan to explore this space by building precision-aware floating-point fault tolerance techniques for applications such as DNNs that exercise low-precision computations.

Published: March 29, 2023

Citation

Fang B., S. Hari, T. Tsai, X. Li, G. Gopalakrishnan, I. Laguna, and K.J. Barker, et al. 2022. Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications. In IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS 2022), November 13-18, 2022, Dallas, TX, 47-52. Piscataway, New Jersey:IEEE. PNNL-SA-177005. doi:10.1109/FTXS56515.2022.00010

Research topics

Resilience and Security

High-Performance Computing

PNNL

Towards Precision-Aware Fault Tolerance Approaches for Mixed-Precision Applications

Abstract

Citation

Research topics

Using Isoefficiency as a Metric to Assess Disaggregated Memory Systems for High Performance Computing

To Cache or not to Cache? Exploring the Design Space of Tunable, HLS-generated Accelerators.

SPARTA: High-Level Synthesis of Parallel Multi-Threaded Accelerators