Thrust 1: Optimize High-Performance Computing Tools

Co-leads: James Ang and P. Saday Sadayappan

Goal: Develop tools that optimize communication modes for parallel codes and utilize Compute eXpress Link technologies efficiently in cloud systems.

High-performance computing (HPC) systems at leadership computing facilities and in the cloud are extremely complex, with tradeoffs between memory and computing power. While new cloud infrastructures can now provide environments similar those found in leadership computing facilities, their optimization and tuning is notably different. The cloud offers additional flexibility for hardware configurations across computing stages which presents an opportunity for increased efficiency but is more challenging for optimization. This thrust aims to develop optimizations that will enable accurate, cost-effective computational chemistry workflows suited to the HPC cloud environment.

1.1 Scheduling Optimizations for Electronic Structure Methods in HPC Cloud

While standard electronic structure methods perform well for medium-sized chemical systems on the cloud, they have not been extensively studied for large systems. As the size of the system increases, communication patterns become increasingly complex and require tailored optimization. The team will explore various virtual machine and network configurations provided by Azure and develop optimizations to achieve efficient communication patterns for several important electronic structure methods. They will design scheduling algorithms to select the type of virtual machine used in different parts of the computational chemistry workflow.

1.2 Machine Learning-Based Auto-Tuning for Electronic Structure Application Workflows in HPC Cloud

Current approaches to working with electronic structure models on leadership computing facilities have involved hand-tuning applications for the available architecture and memory system. However, this approach is practically infeasible in an HPC cloud environment due to the variety of hardware configurations available. TEC⁴ researchers will develop a machine learning-based framework to optimize parameter configurations for different electronic structure workflows. The team will combine established analytical and machine learning models to achieve this goal.

1.3 Advanced Memory Technologies to Enable Cloud-Based Scientific Computing

Scientific computing requires significant resources across memory, networking, storage, and computing power. Specialized hardware can solve some of these problems but can affect the allocation of memory and therefore overall performance. Compute eXpress Link is an upcoming technology that can help address these shortcomings with disaggregated memory. TEC⁴ is leading the way in applying this technology to scientific computing, understanding how it may interact with chemistry codes and machine learning algorithms.