Scale Out Thrust
The rise of machine learning (ML) and artificial intelligence (AI) promises the automatization of specific tasks and increased productivity that will rival other technological revolutions. However, because of the ever-growing demands of computational performance, memory/storage capacity, and energy consumption, ML/AI can be classified as a high-performance workflow. As part of this family, they face similar challenges to other sibling workflows, e.g., better utilization of resources and concurrency, network contention at specific all-to-all communication points, while introducing more nuances challenges, such as vast sets of weights and biases which might overwhelm a single computing device, a long-time, usually expensive, investment when training. Moreover, introducing extremely heterogeneous computational substrates (i.e., computing nodes composed of several CPUs plus different accelerators, all communicating through intra or inter-network fabrics) makes the efficient porting these workflows even more difficult.
The concept of disaggregated memory, when applied in the context of high-performance workflows, can bridge the gap between 'regular' and 'expert' programmers. It provides a highly productive abstraction of the distributed memory system, offering a potential solution to the challenges posed by these workflows. However, to fully exploit the performance opportunities it presents, a collaborative effort between users and the system software is crucial. This thrust is dedicated to addressing the challenges that arise from the interaction of these concurrent actors at scale. Our focus is on the connections to higher scientific programming languages (in our case OpenMP) and the meso layers of the LLVM infrastructure to target asynchronous many task runtime systems with sophisticated distributed memory models.
The compiler and runtime are integral to addressing the challenges of high-performance workflows. From the compiler's perspective, we aim to establish a connection to higher-level scientific languages, such as OpenMP, and to evolve classical compiler analysis to introduce concepts such as remote, local, and fast memory under an abstract machine view. These optimizations are guided by the capabilities of the underlying runtime system and its memory model. On the runtime front, the disaggregated memory abstraction necessitates a redesign of some essential features in the underlying layer to better exploit the different memory levels available under this paradigm. This design includes determining the most effective channels of communication between memory regions, effective collective operations, and synchronization at scale. These design decisions evolve in a lock step, iterative fashion between the two components with information from other trusts (especially the scale up one).
As part of the iterative process, we have selected mini workflows that will inform our design decisions. The workflows of interest were selected based on their memory use, either by having "interesting" access patterns or large capacity requirements, their concurrency characteristics, and their relative simplicity. The first of these mini workflows is based on gradient-boosted decision trees / random forests, in which auxiliary structures (like histograms) can use more memory for observations to create a trade-off between accuracy and memory capacity. Moreover, because the data structure depends on the input data's entropy, the access patterns between nodes can result in remote operations.
We selected a nascent proxy application derived from PanGenomics for the second mini workflow. The team uses the Giraffe framework as the base application to extract the memory and concurrency requirements. The sequence matching algorithm matches many genetic sequences against a Burrows-Wheeler transform Graph (a genome graph representing a compressed genome version). This sequence stream creates concurrency while presenting interesting memory patterns since it must follow the graph structure to make the match.
These trusts have the following sub-trusts associated with them. We are developing a characterization of both mini workflows in terms of memory behavior and how concurrency affects them. We are producing a prototype implementation of the compiler infrastructure that uses the distributed runtime as its target, generated from a restricted OpenMP code. We are redesigning the runtime in terms of disaggregated memory, focusing on synchronization at scale. We consider integrating memory-centric optimization in the compiler layers while targeting a distributed memory model.
Publications
To appear in BioSys 2024 (In Conjunction with ASPLOS 2024). April 27, 2024. Jessica Imlau Dagostini, Scott Beamer, Tyler Sorensen, (University of California Santa Cruz), Joseph Manzano, and Andres Marquez, (Pacific Northwest National Laboratory) “Developing a Proxy Application Based on a Parallel Pangenome Mapping Tool.” Extended Abstract.