Modern high-performance and warehouse computing centers show strong interest in minimizing system power consumption while satisfying customers’ quality of service (QoS). Dynamic voltage and frequency scaling (DVFS) is effective for achieving this goal. Nevertheless, automating the process online and making it transparent to users must address three major challenges: (1) Complexity — today’s hardware components (e.g., CPUs, GPUs, memory, network, etc.) can be configured in several or dozens of frequency/voltage states for satisfying divergent system demands. Given their combination and the emergence of heterogeneity, searching the optimal configuration in the design space online can be timing consuming. (2) QoS guarantee — user-defined objectives such as power constraint and performance target must be monitored, predicted and ensured at the best effort. (3) Adaptability — various known and unknown workloads run on systems. Workloads characteristics should be quickly determined and configurations dynamically adjusted in accord with workloads and QoS. In this work, we focus on applications exhibiting an interesting feature – iterative or periodic, which is common among conventional HPC and emerging machine learning workloads. We propose an online dynamic power-performance (ODPP) management framework to dynamically adjust GPU DVFS configurations to meet performance and power objectives and constraints, without any code annotation or intrusion. Particularly, ODPP extracts the performance and power indicators for applications from their resources utilization profiles in a short episode. It further automatically constructs an accurate model that infers from the indicators how the application's performance and power vary with GPU core and memory frequencies. Aided with the model, for both seen and unseen applications, ODPP can quickly determine the most appropriate DVFS configuration for their execution. We evaluate ODPP on an NVIDIA GPU using multiple exascale computing (ECP) and deep learning applications.
Revised: September 22, 2020 |
Published: May 11, 2020
Citation
Zou P., A. Li, K.J. Barker, and R. Ge. 2020.Indicator-directed Dynamic Power Management for Iterative Workloads on GPU-Accelerated Systems. In The 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2020), May 11-144, 2020, Melbourne, Australia, 559-568. Piscataway, New Jersey:IEEE.PNNL-SA-148280.doi:10.1109/CCGrid49817.2020.00-37