August 1, 2024
Conference Paper
AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators
Abstract
Tensor algebra operations represent an important class of algorithms used across many applications, including machine learning, scientific computing, and data analytics. As a result, the efficient generation of custom accelerators for tensor operations has received increased attention. Previous efforts have produced automated tools enabling users to prototype and explore optimized accelerators. However, little effort has been focused on the host-accelerator interaction in these tools. Efficient use of hardware accelerators requires knowledge about the accelerator's capabilities (operations, data formats, and opcode support), the host CPU microarchitecture (e.g., memory hierarchy), the host-accelerator interface, and the application's features (which code regions should be mapped onto an accelerator). Manually rewriting the original applications to facilitate improved custom accelerator mapping is an error-prone and time-consuming endeavor. To cope with this, we propose AXI4MLIR, a new framework to automatically generate and optimize the communication between the host CPU and arbitrary accelerators that implement linear algebra algorithms. AXI4MLIR extends the MLIR compiler framework to automatically generate efficient host-accelerator driver code for accelerators with AXI-based interfaces. Our compiler extensions enable automatic driver code generation while carefully considering the host's memory hierarchy and target accelerator features. To demonstrate the flexibility and utility of AXI4MLIR, we test it with diverse use cases that include different types of accelerators, tiling scenarios, and dataflow schemes. We compare our experimental results to manual implementations of host-accelerator driver code and find that our approach can reduce CPU cache references by 56% and deliver up to a 1.65x speedup.Published: August 1, 2024