Skip to Main Content U.S. Department of Energy
Science Directorate
Page 40 of 245

Advanced Computing, Mathematics and Data
Research Highlights

March 2016

A Spotlight on Improving Computing System Performance

Song's research hits a high note with HPDC'16 program committee

Shuaiwen Leon Song, a research scientist with PNNL Advanced Computing, Mathematics, and Data Division’s High Performance Computing group, is in an enviable position: he co-authored two accepted full papers that will be featured during this year’s Association for Computing Machinery 25th International Symposium on High-Performance Parallel and Distributed Computing, known as HPDC’16. According to the HPDC’16 program co-chairs, the competition was especially tough. All papers underwent a two-step, multi-person committee review, and of 129 originally submitted to the conference, only 20 full papers made the final cut—a roughly 15 percent acceptance rate.


In summer 2015, Shuaiwen Leon Song (left) served as a co-mentor to Dingwen Tao (right).

A hardline against soft errors

One paper, “New-Sum: A Novel Online ABFT Scheme For General Iterative Methods,” tackles designing new algorithm-based fault tolerance schemes (“ABFT” in the title) for iterative solvers on large-scale machines. Among the contributors, Dingwen Tao, currently a University of California, Riverside Ph.D. student, spent his summer internship at PNNL in 2015 working with Song and Sriram Krishnamoorthy, who served as his mentors. In addition to Tao, Song, and Krishnamoorthy, Darren Kerbyson, associate division director of PNNL’s HPC group, with additional researchers from UC Riverside and Rutgers University, also contributed to the work.


Sriram Krishnamoorthy (left) and Darren Kerbyson (right) also were co-authors of the paper addressing novel ABFT schemes that may combat soft errors in systems.

The resulting collaboration yielded a new checksum encoding mechanism that tolerates cache and register bit-flips and does not require additional checksum verifications after every vector-generating operation. Based on the new checksum scheme, they developed the basic and two-level online ABFT algorithms that can detect errors based on system error rates.

ABFTs allow computing systems to detect and potentially correct errors at a lower cost than traditional redundancy computation, positioning them to be particularly helpful in dealing with soft errors that may not crash systems but can lead to silent data corruption. As implied in the name, silent data corruption allows errors to progress undetected until they potentially snowball into system failure or permanent data loss. “Silent” errors already have occurred in leadership-class supercomputers, a result of the voluminous data moving through these systems in such a relatively short time.

“As supercomputers continue to become more complex and subject to power and energy restraints, soft errors will only grow as a factor of concern,” Song explained. “Using the Stampede supercomputer at the Texas Advanced Computing Center, we were able to evaluate our online ABFT designs, which showed only trivial overhead for various error scenarios and demonstrated tangible flexibility for detecting and recovering various types of soft errors in general iterative methods.”

‘False sharing’ can be positive

In multithreaded systems, computers “multitask” using as much of a single core as possible. Equipped by eight of the top 10 supercomputers in the world today, simultaneous multithreading (SMT) architectures provide added efficiency for that process. However, there is a risk that the SMT threads will contend with each other for shared resources, including functional units (instruction issuing slots, load store queues, integer and floating units) and the entire memory subsystem (all levels of cache, prefetchers, and bandwidth), causing severe performance degradation.

To address such performance issues, Song with Probir Roy and Xu Liu, both of the College of William and Mary, developed a first-of-its-kind scheme that introduces false sharing among SMT threads, which helps exploit inter-thread locality between SMT threads to reduce memory contention and optimize HPC applications performance. They also created the SMTAnalyzer, a performance tool that identifies SMT-aware optimization opportunities in multi-threaded programs with low overhead. Their method is described in the second HPDC’16-accepted full paper, “SMT-Aware Instantaneous Footprint Optimization.”

In their research, Song and his co-authors conducted systematic performance analyses, characterizing SMT performance impacts on various benchmarks, including LULESH, IRSmk, Needle, SRAD, LU, Stencil, 3D Tensor, and StreamCluster 2. Then, they quantified their detailed memory-level contention on Intel’s Xeon and Xeon Phi and IBM’s POWER7 SMT architectures. Their results showed the SMT-aware optimization scheme, guided by the SMTAnalyzer, could improve the performance of these programs via drastic reduction on memory contention.

“These papers are another exemplary addition to Leon’s expanding publications repertoire, and they highlight our group’s impressive collaborative research efforts,” Kerbyson added. “Continued acceptance in these distinguished international conferences by our staff reinforces that PNNL has a commitment to providing state-of-the-art computer science, designed to impact current and future HPC systems.”

HPDC’16 is a premier international conference that showcases the latest in design, implementation, evaluation, and use of parallel and distributed systems for high-end computing. The symposium program will feature distinguished keynote talks, technical paper presentations, and a poster session along with several affiliated workshops. HPDC’16 is being held in Kyoto, Japan on May 31-June 04, 2016.

Acknowledgments: Research featured in these papers was supported by the U.S. Department of Energy (DOE), Office of Advanced Scientific Computing Research, specifically, the DOE’s Center for Exascale Simulation of Advanced Reactors (CESAR), Whole-Program Adaptive Detection and Mitigation (AEDAM), and Center for Advanced Technology Evaluation (CENATE) projects. Some experiments for the New-Sum paper were conducted on Stampede, a supercomputer housed within the Texas Advanced Computing Center, The University of Texas at Austin.

Research Teams:

New-Sum: Darren Kerbyson, Sriram Krishnamoothy, and Shuaiwen Leon Song, PNNL; Zizhong Chen, Xin Liang, Dingwen Tao, and Panro Wu, UC Riverside; Eddy Z. Zhang, Rutgers University.

SMT-Aware: Shuaiwen Leon Song, PNNL; Xu Liu and Probir Roy, College of William and Mary

References:

  • Roy P, X Liu, and SL Song. 2016. “SMT-Aware Instantaneous Footprint Optimization.”
  • Tao D, SL Song, S Krishnamoorthy, P Wu, X Liang, EZ Zhang, D Kerbyson, and Z Chen. 2016. “New-Sum: A Novel Online ABFT Scheme For General Iterative Methods.”

Both papers will be featured at the 25th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’16). May 31-June 04, 2016, Kyoto, Japan.


Page 40 of 245

Science at PNNL

Core Research Areas

User Facilities

Centers & Institutes

Additional Information

Research Highlights Home

Share

Print this page (?)

YouTube Facebook Flickr TwitThis LinkedIn

Contacts