February 19, 2018
Feature

Taming Big Data Analytics Workloads

HPC scientists to showcase SHAD developer framework at upcoming IEEE/ACM CCGrid 2018 Conference

Thumbnail
Vito Giovanni Castellana (left) and Marco Minutoli (right)

The unprecedented amount of rapidly changing data that needs to be processed in emerging data analytics applications poses novel computational challenges impacting both hardware and software. Options that require customizing architectures, software, or both to target specific problems mean long development times, difficult-to-achieve solutions, and limited flexibility. Computer scientists Vito Giovanni Castellana and Marco Minutoli, from PNNL’s High Performance Computing group, are among those seeking viable solutions to evolving big data problems. Recently, their work documented in “SHAD: the Scalable High-performance Algorithms and Data-structures Library,” was accepted for inclusion in the main program at the upcoming 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, known as CCGrid 2018.

Built to aid application developers, SHAD can provide scalability and performance that unlike other high-performance data analytics frameworks, aims to support different application domains, including graph processing, machine learning, and data mining.

“There is a gap in current technologies between productivity, performance, and versatility,” Castellana said. “Developers of high performance data analytics software spend significant effort tuning and optimizing their solutions for best-in-class performance, while data scientists usually trade performance for higher productivity. SHAD wants to fill the gap by providing a unified environment for both classes of users.”

SHAD facilitates application development by providing a high-level shared-memory programming environment and general-purpose data structures with interfaces inspired by common programming languages libraries. The data structures, such as Array, Vector, Map, and Set, are designed to accommodate high data volumes that can be accessed in massively parallel computing environments and used as building blocks for SHAD extensions, such as higher-level software libraries. As both Castellana and Minutoli agree that open-source software is fundamental to engaging other scientists and advancing science and technology, SHAD currently is publicly available under an Apache license at: https://github.com/pnnl/SHAD.

 

Thumbnail

In their work to be presented at CCGrid 2018, Castellana and Minutoli evaluated SHAD’s flexibility using a cluster of 24 nodes equipped with two Intel Xeon E5-2680 v2 central processing units, working at 2.8 GHz and 768 GB of memory per node. They scaled their experiments up to 320 cores. In direct comparisons of SHAD’s performance on single node machines and clusters, for example, versus C++ standard libraries, and graph applications, SHAD notably demonstrated improved productivity with good performance and scalability. Moreover, when compared with custom solutions, SHAD provided similar performance with highly reduced development effort.

“Scientists and engineers can use SHAD to quickly prototype their ideas and speedup the development of complex software systems,” Minutoli added. “While our goal is to deliver a productive and user-friendly environment, we are committed to providing the best trade-off between productivity and performance.”

CCGrid 2018 emphasizes research using and impacting cluster, cloud, and grid computing and is the primary international forum for showcasing results and technological developments. Some areas of interest include applications, architecture and networking programming models and runtime systems, and performance modeling and evaluation. A truly global conference, CCGrid 2018 returns to the United States this year and is being held on May 1-4, 2018 in Washington D.C.

Funding:
This work was supported in part by the High Performance Data Analytics (HPDA) Program at Pacific Northwest National Laboratory.

Reference:

  • Castellana VG and M Minutoli. 2018. “SHAD: the Scalable High-performance Algorithms and Data-structures Library.” To be presented at the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2018), May 01-04, 2018, Washington D.C. 

Download Publication

Key Capabilities

###

About PNNL

Pacific Northwest National Laboratory draws on its distinguishing strengths in chemistry, Earth sciences, biology and data science to advance scientific knowledge and address challenges in sustainable energy and national security. Founded in 1965, PNNL is operated by Battelle for the Department of Energy’s Office of Science, which is the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science. For more information on PNNL, visit PNNL's News Center. Follow us on Twitter, Facebook, LinkedIn and Instagram.