November 18, 2024
Conference Paper
SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications
Abstract
Communication switches have sometimes been augmented to process collectives (e.g., the IBM BlueGene project and the Mellanox SHArP switch). In this work, we find that there is a great acceleration opportunity through the further augmentation of switches to accelerate more complex functions that combine communication with computation. We consider three types of such functions. The first is fully-fused collectives built by fusing multiple existing collectives like Allreduce with Alltoall. The second is semi-fused collectives built by combining a collective with another computation. The third we refer to as higher-order collectives built by combining multiple computations and communications, such as to perform matrix-matrix multiply (PGEMM). In this work, we propose a framework called SmartFuse to accelerate fused collective functions. The core of SmartFuse is a reconfigurable smart switch to support these operations. The semi/fully fused collectives are implemented with a CGRAlike architecture, while higher-order collectives are implemented with a more specialized computational unit that can also schedule communication. Supporting our framework is software to evaluate and translate relevant parts of the input program, compile them into a control data flow graph, and then map this graph to the switch hardware. The proposed framework, once deployed, has the strong potential to accelerate existing HPC applications transparently by encapsulation within an MPI implementation. Experimental results show that this approach improves the performance of the PGEMM kernel, MINIFE, and AMG by, on average, 94%, 15%, and 13%, respectively.Published: November 18, 2024