Although Transformer-based language representations achieve state-of-the-art accuracy on various natural language processing (NLP) tasks, the large model size has been challenging the resource constrained computing platforms. Weight pruning, as a popular and effective technique in reducing the number of weight parameters and accelerating the Transformer,
has been investigated on GPUs. However, the Transformer
acceleration using weight pruning on field-programmable gate
array (FPGAs) remains unexplored. This paper investigates the
column balanced block-wise pruning on Transformer and designs
an FPGA acceleration engine to customize the balanced blockwise matrix multiplication. We implement the Transformer model
with proper hardware scheduling, and the experiments show that
the Transformer inference on FPGA achieves 10.35 ms latency
with the batch size of 32, which is 10.96 × speed up comparing to
CPU platform and 2.08 × speed up comparing to GPU platform.
Published: August 1, 2021
Citation
Peng H., S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, and S. Wang, et al. 2021.Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning. In Proceedings of the 22nd International Symposium on Quality Electronic Design (ISQED 2021), April 7-9, 2021, Santa Clara, CA, 142-148. Piscataway, New Jersey:IEEE.PNNL-SA-159983.doi:10.1109/ISQED51717.2021.9424344