Publications

Note: Some links are only provided through paid subscription services, which may limit access. Consult your institution’s library for assistance in obtaining these documents.

2021

Li A and S Su. 2021. "Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs." IEEE Transactions on Parallel and Distributed Systems 32(7):1878–1891. PNNL-SA-156570. DOI: 10.1109/TPDS.2020.3045828.

Geng T, A Li, T Wang, C Wu, Y Li, R Shi, and W Wu, et al. 2021. "O3BNN-R: An Out-Of-Order Architecture for High-Performance and Regularized BNN Inference." IEEE Transactions on Parallel and Distributed Systems 32(1):199–213. PNNL-SA-148318. DOI: 10.1109/TPDS.2020.3013637.

2020

Zou P., A. Li, K.J. Barker, and R. Ge. 2020. "Indicator-directed Dynamic Power Management for Iterative Workloads on GPU-Accelerated Systems." In The 20^th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid 2020), May 11–144, 2020, Melbourne, Australia, 559–568. Piscataway, New Jersey: IEEE. PNNL-SA-148280. DOI: 10.1109/CCGrid49817.2020.00-37.

Zou P, A Li, KJ Barker, and R Ge. 2020. "Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines." In Proceedings of the 49^th International Conference on Parallel Processing (ICPP 2020) August 17–20, 2020, Online., Article No.3404435. New York, New York: Association for Computing Machinery. PNNL-SA-148325. DOI: 10.1145/3404397.3404435.

Zhang X, S Song, C Xie, X Fu, J Wang, and W Zhang. 2020. "Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design." In Proceedings of the 26^th IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 22–26, 2020, San Diego, California, 542-555. Los Alamitos, California: IEEE Computer Society. PNNL-SA-149995. DOI: 10.1109/HPCA47549.2020.00051.

Wu X, Y Yi, D Tian, and J Li. 2020. "Generic, Sparse Tensor Core for Neural Networks." In 1^st International Workshop on Machine Learning for Software Hardware Co-Design. PNNL-SA-156246.

Wang T, T Geng, A Li, X Jin, and M Herbordt. 2020. "FPDeep: Scalable Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters." IEEE Transactions on Computers 69(8):1143–1158. PNNL-SA-148553. DOI: 10.1109/TC.2020.3000118.

Shi R, P Dong, T Geng, Y Ding, X Ma, H So, and M Herbordt, et al. 2020. "CSB-RNN: A Faster-Than-Realtime RNN Acceleration Framework with Compressed Structured Blocks." In ICS '20: Proceedings of the 34^th ACM International Conference on Supercomputing, June 29–July 2, 2020, Barcelona, Spain, Article No. 24. New York, New York: Association for Computing Machinery. PNNL-SA-150973. DOI: 10.1145/3392717.3392749.

Nickless WK. 2020. User Access to Scientific Facilities via 5G: A Cyber Security Thought Experiment. PNNL-2963, Pacific Northwest National Laboratory, Richland, Washington.

Li J, M Lakshminarasimhan, X Wu, A Li, C Olschanowsky, and KJ Barker. 2020. "A Sparse Tensor Benchmark Suite for CPUs and GPUs." In 2020 IEEE International Symposium on Workload Characterization (IISWC), October 27–30, 2020, Beijing, China, 193–204. Piscataway, New Jersey: IEEE. PNNL-SA-142736. DOI: 10.1109/IISWC50251.2020.00027.

Li J, M Lakshminarasimhan, X Wu, A Li, C Olschanowsky, and KJ Barker. 2020. "A Parallel Sparse Tensor Benchmark Suite on CPUs and GPUs." In arXiv. PNNL-SA-146278.

Li A, S Song, J Chen, J Li, X Liu, NR Tallent, and KJ Barker. 2020. "Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch, and GPUDirect." IEEE Transactions on Parallel and Distributed Systems 31(1):94–110. PNNL-SA-141707. DOI: 10.1109/TPDS.2019.2928289.

Hein E, S Eswar, A Yasar, J Li, JS Young, T Conte, and U Catalyurek, et al. 2020. "Programming Strategies for Irregular Algorithms on the Emu Chick." ACM Transactions on Parallel Computing 7(4): Article No. 25. PNNL-SA-144460. DOI: 10.1145/3418077.

Geng T, C Wu, C Tan, B Fang, A Li, and M Herbordt. 2020. "CQNN: a CGRA-based QNN Framework." In IEEE High Performance Extreme Computing Conference (HPEC 2020), September 22–24, 2020, Waltham, Massachusetts, 1–7. Piscataway, New Jersey: IEEE. PNNL-SA-153940. DOI: 10.1109/HPEC43674.2020.9286194.

Geng T, A Li, R Shi, C Wu, T Wang, Y Li, and P Haghi, et al. 2020. "AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing." In Proceedings 53^rd IEEE/ACM International Symposium on Microarchitecture (MICRO), October 17–21, 2020, Athens, Greece, 922–936. Piscataway, New Jersey: IEEE. PNNL-SA-146537. DOI: 10.1109/MICRO50266.2020.00079.

Gawande NA, JA Daily, C Siegel, NR Tallent, and A Vishnu. 2020. "Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing." Future Generation Computer Systems 108. PNNL-SA-134513. DOI: 10.1016/j.future.2018.04.073.

Firoz JS, A Li, J Li, and KJ Barker. 2020. "On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics." In IEEE High Performance Extreme Computing Conference (HPEC 2020), September 22–24, 2020, Waltham, MA, 1–7. Piscataway, New Jersey: IEEE. PNNL-SA-153853. DOI: 10.1109/HPEC43674.2020.9286152.

Carroll TE. 2020. Uncontrolled Read/Write Access to GPU Global Memory via cudaMemHandle Forging. PNNL-29696, Pacific Northwest National Laboratory, Richland, Washington.

2019

Zou P, A Li, KJ Barker, and R Ge. 2019. "Fingerprinting Anomalous Computation with RNN for GPU-accelerated HPC Machines." In IEEE International Symposium on Workload Characterization (IISWC 2019), November 3–5, 2019, Orlando, Florida, 253–256. Piscataway, New Jersey: IEEE. PNNL-SA-144356. doi: 10.1109/IISWC47752.2019.9042165.

Young JS, E Hein, S Eswar, P Lavin, J Li, EJ Riedy, and R.Vuduc, et al. 2019. "A Microbenchmark Characterization of the Emu Chick." Parallel Computing 87. PNNL-SA-143941. DOI: 10.1016/j.parco.2019.04.012.

Xie C, X Zhang, A Li, X Fu, and S Song. 2019. "PIM-VR: Erasing Motion Anomalies in Highly-Interactive Virtual Reality World With Customized Memory Cube." In IEEE International Symposium on High Performance Computer Architecture. PNNL-SA-143513. DOI: 10.1109/HPCA.2019.00013.

Xie C, X Fu, and S Song. 2019. "OO-VR: NUMA Friendly Object-Oriented VR Rendering Framework for Future NUMA-Based Multi-GPU Systems." In ISCA '19: Proceedings of the 46^th International Symposium on Computer Architecture. PNNL-SA-143989.

Nisa I, J Li, A Sukumaran-Rajan, R Vuduc, and P Sadayappan. 2019. "Load-balanced sparse MTTKRP on GPUs." In 33^rd IEEE International Parallel & Distributed Processing Symposium. PNNL-SA-138752. DOI: 10.1109/IPDPS.2019.00023.

Nisa I, J Li, A Sukumaran-Rajan, P Rawat, S Krishnamoorthy, and P Sadayappan. 2019. "An Efficient Mixed-Mode Representation of Sparse Tensors." In SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, November 17–22, 2019, Denver, Colorado, Article No. a49. Los Alamitos, California: IEEE Computer Society. PNNL-SA-142737.
DOI: 10.1145/3295500.3356216.

Meng K, J Li, G Tan, and N Sun. 2019. "A Pattern Based Algorithmic Autotuner for Graph Processing on GPUs." In PPoPP '19: Proceedings of the 24^th Symposium on Principles and Practice of Parallel Programming, February 16–20, 2019, Washington, D.C., 201–213. New York, New York: Association for Computing Machinery. PNNL-SA-140392. DOI: 10.1145/3293883.3295716.

Liu J, D Li, G Kestor, and JS Vetter. 2019. "Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training." In Runtime Concurrency Control and Operation Scheduling for High Performance Neural Network Training, May 20-24, 2019, Rio de Janeiro, Brazil, 188–199. Los Alamitos, California: IEEE Computer Society. PNNL-SA-141283. DOI: 10.1109/IPDPS.2019.00029.

Li J, Y Ma, X Wu, A Li, and KJ Barker. 2019. "PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite." CCF Transactions on High Performance Computing 1(2):111–130. PNNL-SA-140675.
DOI: 10.1007/s42514-019-00012-w.

Li J, B Ucar, U Catalyurek, J Sun, KJ Barker, and R Vuduc. 2019. "Efficient and Effective Sparse Tensor Reordering." In ICS '19: Proceedings of the ACM International Conference on Supercomputing
June 26–28, 2019, Phoenix, Arizona, 227–237. New York, New York: ACM. PNNL-SA-138751.
DOI: 10.1145/3330345.3330366.

Li A, T Geng, T Wang, M Herbordt, S Song, and KJ Barker. 2019. "BSTC: A Novel Binarized-Soft-Tensor-Core Design for Accelerating Bit-Based Approximated Neural Nets." In SC '19 International Conference for High Performance Computing, Networking, Storage, and Analysis, November 17–22, 2019, Denver, Colorado, Article No a38. PNNL-SA-142851. DOI: 10.1145/3295500.3356169.

Castellana VG, M Minutoli, A Tumeo, M Lattuada, P Fezzardi, and F Ferrandi. 2019. "Software Defined Architectures for Data Analytics." In Proceedings of the 24^th Asia and South Pacific Design Automation Conference (ASPDAC 2019), January 21–24, 2019, Tokyo, Japan, 711–718. New York, New York: ACM. PNNL-SA-139669. DOI: 10.1145/3287624.3288754.

2018

Wang L, J Ye, Y Zhao, W Wu, A Li, S Song, and Z Xu, et al. 2018. "SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks." In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, (PPOPP 2018), February 24–28, 2018, Vienna, Austria, 41–53. New York, New York: ACM. PNNL-SA-143407. DOI: 10.1145/3200691.3178491.

Shen D, A Li, S Song, and X Liu. 2018. "CUDAAdvisor: LLVM-based Runtime Profiling for Modern GPUs." In CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization. PNNL-SA-143512. DOI: 10.1145/3168831.

Li A, W Liu, L Wang, KJ Barker, and S Song. 2018. "Warp-Consolidation: A Novel Execution Model for GPUs." In ICS '18: Proceedings of the 2018 International Conference on Supercomputing. June 2018. Pages 53–64. PNNL-SA-133947. DOI: 10.1145/3205289.3205294.

Li A, S Song, J Chen, X Liu, NR Tallent, and KJ Barker. 2018. "Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite." In IEEE International Symposium on Workload Characterization (IISWC 2018), September 30–October 2, 2018, 191–202. Piscataway, New Jersey: IEEE. PNNL-SA-137642. DOI: 10.1109/IISWC.2018.8573483.

Barker KJ, NR Tallent, A Marquez, MC Macduff, A Li, RD Friese, and A Tumeo, et al. 2018. CENATE Status Report. PNNL-27651, Pacific Northwest National Laboratory, Richland, Washington.

Tallent NR, NA Gawande, C Siegel, A Vishnu, and A Hoisie. “Evaluating On-Node GPU Interconnects for Deep Learning Workloads.” In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Cham, 2018. PNNL-SA-129849. DOI: 10.1007/978-3-319-72971-8_1.

2017

Zhao W, A Li, Y Wang, and Y Ha. 2017. "Analysis and Design of Energy-Efficient Data-Dependent SRAM." In IEEE 12^th International Conference on ASIC. PNNL-SA-143165. DOI: 10.1109/ASICON.2017.8252625.

Xie C, S Song, J Wang, W Zhang, and X Fu. 2017. "Processing-in-Memory Enabled Graphics Processors for 3D Rendering." In IEEE International Symposium on High Performance Computer Architecture (HPCA 2017), February 4–8, 2017, Austin, Texas, 637–648. Los Alamitos, California: IEEE Computer Society. PNNL-SA-122891. DOI: 10.1109/HPCA.2017.37.

Tallent NR, DJ Kerbyson, and A Hoisie. 2017. "Representative Paths Analysis." In SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. PNNL-SA-127589. DOI: 10.1145/3126908.3126962.

Liu W, A Li, JD Hogg, IS Duff, and B Vinter. 2017. "Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides." Concurrency and Computation: Practice and Experience 29(21): Article e4244. PNNL-SA-130501. DOI: 10.1002/cpe.4244.

Li A, W Zhao, and S Song. 2017. "BVF: Enabling Significant On-Chip Power Savings via Bit-Value-Favor for Throughput Processors." In MICRO-50 '17: Proceedings of the 50^th Annual IEEE/ACM International Symposium on Microarchitecture. PNNL-SA-130500. DOI: 10.1145/3123939.3123944.

Li A, W Liu, M Kristensen, B Vinter, H Wang, K Hou, and A Marquez, et al. 2017. "Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels." In SC '17 International Conference for High Performance Computing, Networking, Storage and Analysis. PNNL-SA-143163.
DOI: 10.1145/3126908.3126931.

Li A, S Song, W Liu, X Liu, A Kumar, and H Corporaal. 2017. "Locality-Aware CTA Clustering For Modern GPUs." In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2017), April 8–12, 2017, Xi'an, China, 297–311. New York, New York: ACM. PNNL-SA-123050. DOI: 10.1145/3037697.3037709.

Gawande NA, JB Landwehr, JA Daily, NR Tallent, A Vishnu, and D.J. Kerbyson. 2017. "Scaling deep learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing." In IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2017), May 29–June 2, 2017, Lake Buena Vista, Florida, 399-408. Los Alamitos, California: IEEE Computer Society. PNNL-SA-129129.
DOI: 10.1109/IPDPSW.2017.36.

Li A, W Zhao, and S Leon Song, “BVF: Enabling Significant On-chip Power Savings via Bit-value-favor for Throughput Processors.” In Proceedings of the 50^th Annual IEEE/ACM International Symposium on Microarchitecture, New York, New York, USA, 2017. DOI: 10.1145/3123939.3123944.

Qiu J, Z Zhao, B Wu, A Vishnu, and S Leon Song, “Enabling scalability-sensitive speculative parallelization for FSM computations.” In Proceedings of the International Conference on Supercomputing, (ICS) 2017, Chicago, Illinois, USA, June 14–16, 2017, 2017. PNNL-SA-125769. DOI: 10.1145/3079079.3079082.

Li A, W Liu, MRB Kristensen, B Vinter, H Wang, K Hou, A Marquez, and S Leon Song, “Exploring and Analyzing the Real Impact of Modern On-package Memory on HPC Scientific Kernels.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, New York, USA, 2017. PNNL-SA-143163. DOI: 10.1145/3126908.3126931.

Friese RD, NR Tallent, A Vishnu, DJ Kerbyson, and A Hoisie, “Generating Performance Models for Irregular Applications.” In 2017 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, Orlando, Florida, USA, May 29–June 2, 2017, 2017. PNNL-SA-123945.
DOI: 10.1109/IPDPS.2017.61.

Li A, S Leon Song, W Liu, X Liu, A Kumar, and H Corporaal, “Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, New York, New York, USA, 2017. PNNL-SA-123050.
DOI: 10.1145/3037697.3037709.

Xie C, S L. Song, J Wang, W Zhang, and X Fu. “Processing-in-Memory Enabled Graphics Processors for 3D Rendering.” In 23^rd IEEE International Symposium on High-Performance Computer Architecture (HPCA-23), Austin, Texas, 2017.

Tallent NR, DJ Kerbyson, and A Hoisie. “Representative Paths Analysis.” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York, New York, USA, 2017. DOI: 10.1145/3126908.3126962.

Gawande NA, JB Landwehr, JA Daily, NR Tallent, A Vishnu, and DJ Kerbyson. “Scaling Deep Learning Workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing.” In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, Orlando/Buena Vista, Florida, USA, May 29–June 2, 2017, 2017. PNNL-SA-129129. DOI: 10.1109/IPDPSW.2017.36.

2016

Tallent NR, KJ Barker, R Gioiosa, A Marquez, G Kestor, S Song, and A Tumeo, et al. 2016. "Assessing Advanced Technology in CENATE." In Proceedings of the IEEE International Conference on Networking, Architecture, and Storage (NAS 2016), August 8–10, 2016, Long Beach, California. Piscataway, New Jersey: IEEE. PNNL-SA-119257. DOI: 10.1109/NAS.2016.7549392.

Tan J, S Song, K Yan, X Fu, A Marquez, and DJ Kerbyson. 2016. "Combating the Reliability Challenge of GPU Register File at Low Supply Voltage." In Proceedings of the 25^th International Conference on Parallel Architectures and Compilation (PACT '16), September 11–15, 2016, Haifa, Israel, 3–15. New York, New York: ACM. PNNL-SA-119484. DOI: 10.1145/2967938.2967951.

Li A, S Leon Song, A Kumar, EZ Zhang, DG Chavarría-Miranda, and H Corporaal, “Critical points based register-concurrency autotuning for GPUs.” In 2016 Design, Automation and Test in Europe Conference and Exhibition, Dresden, Germany, March 14–18, 2016, 2016.

Hayes A, L Li, DG Chavarria, S Song, and E Zhang. 2016. "ORION: A Framework for GPU Occupancy Tuning." In Proceedings of the 17^th Middleware Conference (Middleware 2016), December 12–16, 2016, Trento, Italy, 1–13; Article No. 18. New York, New York: ACM. PNNL-SA-120583.
DOI: 10.1145/2988336.2988355.

Li A, S Song, M Wijtvliet, A Kumar, and H Corporaal. 2016. "SFU-Driven Transparent Approximation Acceleration on GPUs." In Proceedings of the International Conference on Supercomputing (ICS 2016), June 1–3, 2016, Istanbul, Turkey, Paper No. 15. New York, New York: Association for Computing Machinery. PNNL-SA-117058. DOI: 10.1145/2925426.

Roy P, X Liu, and S Song. 2016. "SMT-Aware Instantaneous Footprint Optimization." In HPDC '16: Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, May 31–June 4, 2016, Kyoto, Japan, 267–279. New York, New York: ACM. PNNL-SA-117062. DOI: 10.1145/2907294.2907308.

Li L, AB Hayes, S Leon Song, and EZ Zhang. 2016. “Tag-Split Cache for Efficient GPGPU Cache Utilization.” In Proceedings of the 2016 International Conference on Supercomputing, New York, New York, USA. PNNL-SA-117315. DOI: 10.1145/2925426.2926253.

Li A, L. S Song, E Brugel, A Kumar, D Chavarria, and H Corporaal. “X: A Comprehensive Analytic Model for Parallel Machines.” In 30^th International Parallel and Distributed Processing Symposium (IPDPS), 2016.

2015

Tan L, Z Chen, and SL Song. 2015. “Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology.” ACM Transactions on Architecture and Code Optimization12:35:1–35:27. PNNL-SA-113322. DOI: 10.1145/2822893.

PNNL

Center for Advanced Technology Evaluation